Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dynamic Programming and Reinforcement Learning applied to Tetris Game

822 views

Published on

Slides presented as a work to Artificial Intelligence's class at IME-USP. This presentation is about how reinforcement learning is applied to a Tetris game.

Published in: Technology
  • Be the first to comment

Dynamic Programming and Reinforcement Learning applied to Tetris Game

  1. 1. Dynamic Programming and Reinforcement Learning applied to Tetris game Suelen Goularte Carvalho Inteligência Artificial 2015
  2. 2. Tetris
  3. 3. Tetris ✓ Board 20 x 10 ✓ 7 types of tetronimos (pieces) ✓ Move to down, left or right ✓ Rotation pieces
  4. 4. Tetris One-Piece Controller Player knows: ✓ board ✓ current piece.
  5. 5. Tetris Two-Piece Controller Player knows: ✓ board ✓ current piece ✓ next piece
  6. 6. Tetris Evaluation One-Piece Controller Two-Piece Controller
  7. 7. How many possibilities do we have just here?
  8. 8. Tetris indeed contains a huge number of board configurations. Finding the strategy that maximizes the average score is an NP-Complete problem! — Building Controllers for Tetris, 2009 7.0 × 2 ≃ 5.6 × 10 199 59
  9. 9. Complexity Tetris
  10. 10. Tetris is a problems of sequential decision making under uncertainty. In the context of dynamic programming and stochastic control, the most important object is the cost-to-go function, which evaluates the expected future cost from current state. — Feature-Based Methods for Large Scale Dynamic Programming
  11. 11. 7000 3000 2500 1000 4000Si 5000 7000 3000 2500 1000 4000 best immediate reward Si immediate reward future reward 13000 9000 immediate reward vs. 5000 best future reward best immediate reward Immediate reward Future reward
  12. 12. 7.0 × 2 ≃ 5.6 × 10 199 59 Essentially impossible to compute, or even store, the value of the cost-to-go function at every possible state. — Feature-Based Methods for Large Scale Dynamic Programming
  13. 13. Compact representation alleviate the computational time and space of dynamic programming, which employs an exhaustive look-up table, storing one value per state. — Feature-Based Methods for Large Scale Dynamic Programming S {s1, s2, …, sn} V {v1, v2, …, sm} where m < n
  14. 14. For example, if the state i represents the number of customers in a queueing system, a possible and often interesting feature f is defined by f(0) = 0 and f(i) = 1 if i > 0. Such a feature focuses on whether a queue is empty or not. — Feature-Based Methods for Large Scale Dynamic Programming
  15. 15. — Feature-Based Methods for Large Scale Dynamic Programming Feature-bases method S {s1, s2, …, sn} V {v1, v2, …, sm} where m < n
  16. 16. — Feature-Based Methods for Large Scale Dynamic Programming Features: ★ Height of the current wall. ★ Number of holes. H = {0, ..., 20}, L = {0, ..., 200}. Feature extraction F : S ~ H x L 10 X 20 Feature-bases method
  17. 17. Using a feature-based evaluation function works better than just choosing the move that realizes the highest immediate reward. — Building Controllers for Tetris, 2009
  18. 18. Example of features — Building Controllers for Tetris, 2009
  19. 19. ...The problem of building a Tetris controller comes down to building a good evaluation function. Ideally, this function should return high values for the good decisions and low values for the bad ones. — Building Controllers for Tetris, 2009
  20. 20. Reinforcement Learning context, algorithms aim at tuning the weights such that the evaluation function approximates well the optimal expected future score from each state. — Building Controllers for Tetris, 2009
  21. 21. Reinforcement Learning
  22. 22. Reinforcement Learning by The Big Bang Theory https://www.youtube.com/watch?v=tV7Zp2B_mt8&list=PLAF3D35931B692F5C
  23. 23. Reinforcement Learning Imagine disputar um novo jogo cuja as regras você não conhece, depois de aproximadamente uma centena de movimentos, seu oponente anuncia: “Você perdeu!”. Em resumo, isso é aprendizagem por reforço.
  24. 24. Supervised Learning input 1 2 3 4 5 6 7 8 …. output 1 2 9 16 25 36 49 64 …. y = f(x) -> function approximation https://www.youtube.com/watch?v=Ki2iHgKxRBo&list=PLAwxTw4SYaPl0N6-e1GvyLp5-MUMUjOKo Map inputs to output f(x) = x labels scores well 2
  25. 25. Unsupervised Learning x x x xx x x xx x o o o o o o o o f(x) -> clusters description o o x x x x x x x x x x o o o o o o o oo o type clusters scores well
  26. 26. Reinforcement Learning Agent Environment ActionReward, State behaviors scores well
  27. 27. Reinforcement Learning ✓ Agents take actions in an environment and receive rewards ✓ Goal is to find the policy π that maximizes rewards ✓ Inspired by research into psychology and animal learning
  28. 28. Reinforcement Learning Model Given: S set of states, A set of actions, T(s, a, s') ~ P(s’ | s, a) transitional model, R reward function 5000 7000 3000 2500 1000 4000Si immediate reward future reward 13000 9000 Find: π(s) = a policy that maximizes
  29. 29. Needs higher computation, processing and memory.
  30. 30. Dynamic Programming
  31. 31. Dynamic Programming Solving problems by breaking it down into simpler subproblems. Solving each subproblems just once, and storing their solutions. https://en.wikipedia.org/wiki/Dynamic_programming
  32. 32. A G caminho ótimo A B caminho ótimo G caminho ótimo Support Property: Optimal Substructure
  33. 33. Fibonacci Sequence 0 1 1 2 3 5 8 13 21 The sum of two numbers before results in the follow number.
  34. 34. 0 1 1 2 3 5 8 13 21 f(n) = f(n-1) + f(n-2) Recursive Formula: v = 0 1 2 3 4 5 6 7 8n = Fibonacci Sequence
  35. 35. Fibonacci 0 1 1 2 3 5 8 13 21 0 1 2 3 4 5 6 7 8 f(6) = f(6-1) + f(6-2) f(6) = f(5) + f(4) f(6) = 5 + 3 f(6) = 8 v = n =
  36. 36. Fibonacci Sequence - Normal computation 6 5 4 4 3 2 2 1 1 0 2 1 2 2 1 3 3 1 0 1 0 1 0 1 0 f(n) = f(n-1) + f(n-2)
  37. 37. 6 5 4 4 3 2 2 1 1 0 2 1 2 2 1 3 3 1 0 1 0 1 0 1 0 Fibonacci Sequence - Normal computation O(n )2
  38. 38. 18 of 25 Nodes Are Repeated Calculations!
  39. 39. Dictionary m m[0]=0, m[1]=1 integer fib(n) if m[n] == null m[n] = fib(n-1)+ fib(n-2) return m[n] Fibonacci Sequence - Dynamic Programming
  40. 40. Fibonacci Sequence - Dynamic Programming 5 4 3 3 2 1 1 0 2 index value 0 1 2 3 4 5 0 1
  41. 41. 5 4 3 3 2 1 1 0 2 index value 0 1 2 3 4 5 0 1 1 1+0=1 Fibonacci Sequence - Dynamic Programming
  42. 42. 5 4 3 3 2 1 1 0 2 index value 0 1 2 3 4 5 0 1 1 2 1+0=1 1+1=2 Fibonacci Sequence - Dynamic Programming
  43. 43. 5 4 3 3 2 1 1 0 2 index value 0 1 2 3 4 5 0 1 1 2 3 1+0=1 1+1=2 2+1=3 Fibonacci Sequence - Dynamic Programming
  44. 44. 5 4 3 3 2 1 1 0 2 O(1) memory O(n) running time index value 0 1 2 3 4 5 0 1 1 2 3 5 1+0=1 1+1=2 2+1=3 3+2=5 Fibonacci Sequence - Dynamic Programming
  45. 45. 100 games played 31 Some scores from time… Tsitsiklis and van Roy (1996) Bertsekas and Tsitsiklis (1996)3200100 games played Kakade (2001) applied without specifying how many game scores are averaged though6800 Farias and van Roy (2006) 90 games played. 4700 — Building Controllers for Tetris, 2009
  46. 46. Two-piece controller with some original features of which the weights were tuned by hand. Only 1 game was played and this took a week. One-piece controller 56 games played. Tuned by hand. 660Mil 7,2Mi Currents best! Dellacherie (Fahey, 2003) Dellacherie (Fahey, 2003) — Building Controllers for Tetris, 2009
  47. 47. Experiment…
  48. 48. Experiment — Feature-Based Methods for Large Scale Dynamic Programming Experienced human Tetris player would take about 3 minutes to eliminate 30 rows.
  49. 49. 20 jogadores. 3 jogadas cada. 3 minutos cada jogada. Experiment cont. 30 Média obtida: 24 score
  50. 50. Jogador 7 (eu) jogada 1 1000 scores ~ 1 row Experiment cont.
  51. 51. • Média 24 score a cada 3 minutos. • Ou seja, 5.760 a cada 12h de jogo contínuo. • Um ser-humano jogando começa a ficar próximo a performance dos algoritmos, após algumas otimizações, após mais ou menos 8h de jogo contínuo. Experiment cont.
  52. 52. Conclusão…
  53. 53. Dynamic Programming Reinforcement Learning Tetris Otimiza a utilização do poder computacional. Otimiza peso utilizado nas features. Utiliza feature-based para maximizar o score.
  54. 54. Dúvidas? Suelen Goularte Carvalho Inteligência Artificial 2015

×