1. Sequential decision making: decidability and complexityGames with partialobservation Olivier.Teytaud@inria.fr + too many people for being all cited. Includes Inria, Cnrs, Univ.Paris-Sud, LRI, Taiwan universities (including NUTN), CITINES projectTAO, Inria-Saclay IDF, Cnrs 8623,Lri, Univ. Paris-Sud,Digiteo Labs, PascalNetwork of Excellence.ParisSeptember 2012.
2. A quite general model A directed graph (finite). A Starting point on the graph, a target (or several targets, with different rewards). I want to reach a target. Labels(=decisions) on edges: Next node = f( current node, decision) Each node is either: - random node (random decision). - decision node (I choose a decision) - opponent node (an opponent chooses)
3. Partial observation Each decision node is equipped with an observation; you can make decisions using the list of past observations ==> you dont know where you are in the graph
4. Overview● 10%: overview of Alternating Turing machine & computational complexity (great tool for complexity upper bounds)● 50%: general culture on games (including undecidability)● 35%: general culture on fictitious play (matrix games) (probably no time for this...)● 4%: my results on that stuff ==> 2 detailed proofs (one new) ==> feel free of interrupting
5. Outline● Complexity and ATM● Complexity and games (incl. planning)● Bounded horizon games
7. Complexity and alternating Turing machines● Turing machine (TM)= abstract computer● Non-deterministic Turing Machine (NTM) = TM with “for all” states (i.e. several transitions, accepts if all transitions accept)● Co-NTM: TM with “exists” states (i.e. several transitions, accepts if at least one transition accepts)● ATM: TM with both “exists” and “for all” states.
8. Complexity and alternating Turing machines● Turing machine (TM)= abstract computer● Non-deterministic Turing Machine (NTM) = TM with “exists” states (i.e. several transitions, accepts if at least one accepts)● Co-NTM: TM with “exists” states (i.e. several transitions, accepts if at least one transition accepts)● ATM: TM with both “exists” and “for all” states.
9. Complexity and alternating Turing machines● Turing machine (TM)= abstract computer● Non-deterministic Turing Machine (NTM) = TM with “exists” states (i.e. several transitions, accepts if at least one accepts)● Co-NTM: TM with “for all” states (i.e. several transitions, accepts if all lead to accept)● ATM: TM with both “exists” and “for all” states.
10. Complexity and alternating Turing machines● Turing machine (TM)= abstract computer● Non-deterministic Turing Machine (NTM) = TM with “exists” states (i.e. several transitions, accepts if at least one accepts)● Co-NTM: TM with “for all” states (i.e. several transitions, accepts if all lead to accept)● ATM: TM with both “exists” and “for all” states.
11. Alternation
12. Non-determinism & alternation
13. Outline● Complexity and ATM● Complexity and games (incl. planning)● Bounded horizon games
14. Computational complexity: framework Uncertainty can be: – Adversarial: I focus on worst case – Stochastic: I focus on average result – Or both. “Stochastic = adversarial” if goal = 100% success. “Stochastic != adversarial” in the general case.
15. Computational complexity: framework Many representations for problems. E.g.: – Succinct: a circuit computes the ith bit of the proba that action a leads to a transition from s to s – Compressed: a circuit computes many bits simultaneously – Flat: longer encoding (transition tables) ==> does not matter for decidability ==> matters for complexity
16. Computational complexity: framework Many representations for problems. E.g.: – Succinct – Compressed – Flat Compressed representation “somehow” natural (state space has exponential size, transitions are fast): see e.g. Mundhenk for detailed defs and flat representations.
17. Computational complexity: framework We use mainly compressed representation; see also Mundhenk for flat representations. Typically, exponentially small representations lead to exponentially higher complexity ==> but its not always the case... Simple things can change a lot the complexity: “superko”: rules forbid twice the same position; some fully observable 2Player games become EXPSPACE instead of EXP ==> discussed later
18. Computational complexity: framework for first tables of results Either search (find a target) or optimize (cumulate rewards over time) Compressed (written with circuits or others...) or not (flat). Horizon: - Short horizon: horizon ≤ size of input - Long horizon: log2(horizon) ≤ size of input - Infinite horizon: no limit
20. Mundhenks summary: one player, non-negative reward, looking for non-neg. average reward (= positive proba of reaching): easier
21. Complexity, partial observation, infinite horizon, proba of reaching a target● 1P+random, unobservable: undecidable (Madani et al)● 1P+random, P(win=1), or equivalently 2P, P(win=1): [Rintanen and refs therein] – Fully observable: EXP [Littman94] – Unobservable: EXPSPACE [Hasslum et al 2000] – Partial observability: 2EXP [rintanen, 2003] Rmk: “2P, P(win=1)” is not “2P”!
22. Complexity, partial observation, infinite horizon● 2P vs 1P,P(win)=1?:undecidable![Hearn, Demaine]● 2P (random or not): – Existence of sure win: equiv. to 1P+random ! ● EXP full-observable (e.g. Go, Robson 1984) ● PSPACE unobservable ● 2EXP partially observable – Existence of sure win, same state forbidden: EXPSPACE-complete (Go with Chinese rules ? rather conjectured EXPTIME or PSPACE...) – General case (optimal play): undecidable (Auger, Teytaud) (what about phantom-Go ?)
23. Complexity, partial observation Remarks:● Continuous case ?● Purely epistemic (we gather information, we dont change the state) ? [Sabbadin et al]● Restrictions on the policy, on the set of actions...● Discounted reward● DEC-POMDP, POSG : many players, same/opposite/different reward functions...
24. What are the approaches ? – Dynamic programming (Massé – Bellman 50s) (still the main approach in industry), alpha-beta, retrograde analysis – Reinforcement learning – MCTS (R. Coulom. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search. In Proceedings of the 5th International Conference on Computers and Games, Turin, Italy, 2006) – Scripts + Tuning / Direct Policy Search – Coevolution All have their PO extensions but the two last are the most convenient in this case.
25. Partially observable games Many tools for fully observable games. Not so many for partially observable ones.● Shi-Fu-Mi (Rock Paper Scissor)● Card games● Phantom games
26. Shi-Fu-Mi (Rock-Paper-Scissors)● Fully observable in simultaneous play, but partially observable in turn-based version.● Computers stronger than humans (yes, its true).
27. Card games, phantom games● Phantomized version of a game: – You dont see the move of your opponents – If you play an illegal move, you are informed that its illegal, you play again – Usually, you get a few more information (captures, threats...) <== game-dependent● Phantom-games: – phantom-Chess = Kriegspiel ==> Dark Chess: more info – phantom-Go – etc.
28. Partially observable games● Usually quite heuristic algorithms● Best performing algorithms combine: – Opponent modelling (as for Shi-Fu-Mi) – Belief state (often by Monte-Carlo simulations) – Not a lot of tree search – A lot of tuning ==> usually no consistency analysis
29. Part I: Complexity analysis(unbounded horizon) – Game: ● One or two players ● Win, loss, draw (incl. endless loop) – Partial observability, no random part – Finite state space: ● state=transition(state,action) ● action decided by each player in turn
30. State of the art - makes sense in fully observable games - not so much in non-observable games
31. State of the art EXPTIME-complete in the general fully-observable case
32. EXPTIME-complete fullyobservable games - Chess (for some nxn generalization) - Go (with no superko) - Draughts (international or english) - Chinese checkers - Shogi
33. PSPACE-complete fullyobservable games - Amazons - Hex polynomial horizon - Go-moku + - Connect-6 full observation - Qubic ==> PSPACE - Reversi - Tic-Tac-Toe Many games with filling of each cell once and only once
34. EXPSPACE-completeunobservable games (Hasslun & Jonnsson) The two-player unobservable case is EXPSPACE-complete (games in succinct form, infinite horizon). (still for 100%win “UD” criterion - for not fully observable cases it is necessary to be precise...)Importantly, the UD criterion means that strategies are the same if the opponent has full observation as if he has no observation ==> UD is very bad :-(
35. E X P S P Atwo-player unobservable case is The C E - c o m p l e t eunobservable games (Hasslun & Jonnsson) EXPSPACE-complete (games in succinct form).PROOF: (I) First note that strategies are just sequences of actions (no observability!) (II) It is in EXPSPACE=NEXPSPACE, because of the following algorithm: (a) Non-deterministically choose the sequence of Actions (b) Check the result against all possible strategies (III) We have to check the hardness only.
36. E X P S P Atwo-player unobservable case is The C E - c o m p l e t eunobservable games (Hasslun & Jonnsson) EXPSPACE-complete (games in succinct form).PROOF: (I) First note that strategies are just sequences of actions (no observability!) (II) It is in EXPSPACE=NEXPSPACE, because of the following algorithm: (a) Non-deterministically choose the sequence of actions (exponential list of actions is enough...) (b) Check the result against all possible strategies (III) We have to check the hardness only.
37. E X P S P Atwo-player unobservable case is The C E - c o m p l e t eunobservable games (Hasslun & Jonnsson) EXPSPACE-complete (games in succinct form).PROOF: (I) First note that strategies are just sequences of actions (no observability!) (II) It is in EXPSPACE=NEXPSPACE, because of the following algorithm: (a) Non-deterministically choose the sequence of actions (b) Check the result against all possible strategies (III) We have to check the hardness only.
38. E X P S P Atwo-player unobservable case is The C E - c o m p l e t eunobservable games (Hasslun & Jonnsson) EXPSPACE-complete (games in succinct form). PROOF of the hardness: Reduction to: is my TM with exponential tape going to halt ? Consider a TM with tape of size N=2^n. We must find a game - with size n ( n= log2(N) ) - such that the first player has a winning strategy for player 1 iff the TM halts.
39. EXPSPACE-completeuEncoding ravTuring machine s ( Ha stape & J osizes oN) n o b s e a b l e g a m e with a s l u n of n n s n as a game with state O(log(N)) Player 1 chooses the sequence of configurations of the tape (N=4): x(0,1),x(0,2),x(0,3),x(0,4) ==> initial state x(1,1),x(1,2),x(1,3),x(1,4) x(2,1),x(2,2),x(2,3),x(2,4) x(3,1),x(3,2),x(3,3),x(3,4) .....................................
40. EXPSPACE-completeuEncoding ravTuring machine s ( Ha stape & J osizes oN) n o b s e a b l e g a m e with a s l u n of n n s n as a game with state O(log(N)) Player 1 chooses the sequence of configurations of the tape (N=4): x(0,1),x(0,2),x(0,3),x(0,4) ==> initial state x(1,1),x(1,2),x(1,3),x(1,4) x(2,1),x(2,2),x(2,3),x(2,4) x(3,1),x(3,2),x(3,3),x(3,4) ..................................... x(N,1), x(N,2), x(N,3), x(N,4) Wins byfinal state !
41. EXPSPACE-completeuEncoding ravTuring machine s ( Ha stape & J osizes oN) n o b s e a b l e g a m e with a s l u n of n n s n as a game with state O(log(N)) Player 1 chooses the sequence of configurations of the tape (N=4): x(0,1),x(0,2),x(0,3),x(0,4) ==> initial state x(1,1),x(1,2),x(1,3),x(1,4) x(2,1),x(2,2),x(2,3),x(2,4)Except if P2 finds an x(3,1),x(3,2),x(3,3),x(3,4) illegal transition! ..................................... ==> P2 can check the x(N,1), x(N,2), x(N,3), x(N,4) consistency of one 3-uple per line Wins by ==> requests space log(N)final state ! ( = position of the 3-uple)
42. EXPSPACE-completeunobservable games The 1P+unknown initial state in the unobservable case is EXPSPACE-complete (games in succinct form). 2P+unobservable as well.
43. 2EXPTIME-complete PO games The two-player PO case, or 1P+random PO is 2EXP-complete (games in succinct form). (2P = 1P+random because of UD)
44. Undecidable games (B. Hearn) The three-player PO case is undecidable. (two players against one, not allowed to communicate)
45. Hummm ? Do you know a PO game in which you can ensure a win with probability 1 ?
46. Another formalization c ==> much more satisfactory (might have drawbacks as well...)
47. Madani et al. c 1 player + random = undecidable (even without opponent!)
48. Madani et al. 1 player + random = undecidable. ==> answers a (related) question by Papadimitriou and Tsitsiklis. Proof ? Based on the emptiness problem for probabilistic finite automata (see Paz 71): Given a probabilistic finite automaton, is there a word accepted with proba at least c ? ==> undecidable
49. Consequence for unobservablegames c 1 player + random = undecidable ==> 2 players = undecidable.
50. Proof of “undecidability with 1 playeragainst random” ==> “undecidability with2 players” How to simulate 1 player + random with 2 players ?
51. A random node to be rewritten
52. A random node to be rewritten
53. A random node to be rewritten Rewritten as follows: ● Player 1 chooses a in [[0,N-1]] ● Player 2 chooses b in [[0,N-1]] ● c=(a+b) modulo N ● Go to tcEach player can force the game to be equivalent tothe initial one (by playing uniformly)==> the proba of winning for player 1 (in case of perfect play) is the same as for for the initial game==> undecidability!
54. Important remark Existence of a strategy for winning with proba 0.5 = also undecidable for the restriction to games in which the proba is >0.6 or <0.4 ==> not just a subtle precision trouble.
55. So what ? We have seen that unbounded horizon + partial observability + natural criterion (not sure win) ==> undecidability contrarily to what is expected from usual definitions. What about bounded horizon, 2P ? – Clearly decidable – Complexity ? – Algorithms ? (==> coevolution & LP)
57. Part II: Fictitious play (boundedhorizon) in the antagonist case Fictitious play ? Somehow an abstract version of antagonist coevolution with full memory● illimited population (finite, but increasing): one more indiv. per iteration● perfect choice of each mutation against the current population of opponents
58. Part II: Fictitious play in thezero-sum case Why zero-sum cases ? Evolutionary stable solutions (found by FP) are usually sub-optimal (as well as nature, for choosing lions strategies or cheating behaviors in Scaly- breasted Munia)
59. What is a matrix 0-sum game ?● A matrix M is given (type n x m).● Player 1 chooses (privately) i in [[1,n]]● Player 2 chooses j in [[1,n]]● Reward = Mij for player 1 = -Mij for player 2 (zero-sum game) ==> Model for finite antagonist games
60. Nash equilibrium● Nash equilibrium: there is a distribution of probability for each player (= mixed strategy) such that the reward is optimum (for the worst case on the distribution of probabilities by the opponent)● Linear programming is a polynomial algorithm for finding the Nash eq.● FP= tool for approximating it (at least in 0-sum cases)
61. Fictitious play (Brown 1949)● Each player starts with a distribution on its strategies● Each player in turn: – Finds an optimal strategy against the current opponents distribution (randomly break ties) – Adds it to its distribution (the distrib. does not sum to 1!)
67. Improvements for KxK matrixgame: approximations● There exists approximations in size O(log(K)/2) [Althoefer]● Such an approximation can be found in time O(Klog K / 2) [Grigoriadis et al]: basically a stochastic FP
68. Improvements for KxK matrixgame: exact solution if k-sparse● There exists approximations in size O(log(K)/2) [Althoefer]● Such an approximation can be found in time O(Klog K / 2) [Grigoriadis et al]: basically a stochastic FP
69. Improvements for KxK matrixgame: approximations● There exists approximations in size O(log(K)/2) [Althoefer]● Such an approximation can be found in time O(Klog K / 2) [Grigoriadis et al]: basically a stochastic FP● Exact solution in time (Auger, Ruette, Teytaud) O (K log K · k 2k + poly(k) ) if solution k-sparse (good only if k smaller than log(K)/log(log(K)) ! better ?)
70. Improvements for KxK matrixgame: approximations So, LP & FP are two tools for matrix games. LP programming can be adapted to PO games without building the complete matrix (using information sets). The same for FP variants ?
71. Conclusions There are still natural questions which provide nice decidability problems Madani et al (1 player against random, no observability), extended here to 2 players with no random ==> undecidable problems “less than” the Halting problem ? Solving zero-sum matrix-games is still an active area of research ● Approximate cases ● Sparse case
72. Open problems● Phantom-Go undecidable ? (or other “real” game...)● Complexity of Go with Chinese rules ? (conjectured: PSPACE or EXPTIME; proved PSPACE-hard + EXPSPACE)● More to say about “epistemic” games (internal state not modified)● Frontier of undecidability in PO games ? (100% halting game: 2P become decidable)● Chess with finitely many pieces on infinite board: decidability of forced-mate ? (n-move: Brumleve et al, 2012, simulation in Presburger (thanks S. Riis :-) )
Be the first to comment