Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Tetris by Aldo Maravilla 249 views
- TETRIS AI WITH REINFORCEMENT LEARNING by Jungkyu Lee 2385 views
- How to write a project proposal by PT carbon indonesia 928 views
- Genetic_Algorithm_AI(TU) by Kapil Khatiwada 827 views
- Genetic Algorithm by Example by Nobal Niraula 91252 views
- Evaluación educativa by jose luis alor eu... 47580 views

733 views

Published on

Published in:
Technology

No Downloads

Total views

733

On SlideShare

0

From Embeds

0

Number of Embeds

5

Shares

0

Downloads

29

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Dynamic Programming and Reinforcement Learning applied to Tetris game Suelen Goularte Carvalho Inteligência Artiﬁcial 2015
- 2. Tetris
- 3. Tetris ✓ Board 20 x 10 ✓ 7 types of tetronimos (pieces) ✓ Move to down, left or right ✓ Rotation pieces
- 4. Tetris One-Piece Controller Player knows: ✓ board ✓ current piece.
- 5. Tetris Two-Piece Controller Player knows: ✓ board ✓ current piece ✓ next piece
- 6. Tetris Evaluation One-Piece Controller Two-Piece Controller
- 7. How many possibilities do we have just here?
- 8. Tetris indeed contains a huge number of board conﬁgurations. Finding the strategy that maximizes the average score is an NP-Complete problem! — Building Controllers for Tetris, 2009 7.0 × 2 ≃ 5.6 × 10 199 59
- 9. Complexity Tetris
- 10. Tetris is a problems of sequential decision making under uncertainty. In the context of dynamic programming and stochastic control, the most important object is the cost-to-go function, which evaluates the expected future cost from current state. — Feature-Based Methods for Large Scale Dynamic Programming
- 11. 7000 3000 2500 1000 4000Si 5000 7000 3000 2500 1000 4000 best immediate reward Si immediate reward future reward 13000 9000 immediate reward vs. 5000 best future reward best immediate reward Immediate reward Future reward
- 12. 7.0 × 2 ≃ 5.6 × 10 199 59 Essentially impossible to compute, or even store, the value of the cost-to-go function at every possible state. — Feature-Based Methods for Large Scale Dynamic Programming
- 13. Compact representation alleviate the computational time and space of dynamic programming, which employs an exhaustive look-up table, storing one value per state. — Feature-Based Methods for Large Scale Dynamic Programming S {s1, s2, …, sn} V {v1, v2, …, sm} where m < n
- 14. For example, if the state i represents the number of customers in a queueing system, a possible and often interesting feature f is deﬁned by f(0) = 0 and f(i) = 1 if i > 0. Such a feature focuses on whether a queue is empty or not. — Feature-Based Methods for Large Scale Dynamic Programming
- 15. — Feature-Based Methods for Large Scale Dynamic Programming Feature-bases method S {s1, s2, …, sn} V {v1, v2, …, sm} where m < n
- 16. — Feature-Based Methods for Large Scale Dynamic Programming Features: ★ Height of the current wall. ★ Number of holes. H = {0, ..., 20}, L = {0, ..., 200}. Feature extraction F : S ~ H x L 10 X 20 Feature-bases method
- 17. Using a feature-based evaluation function works better than just choosing the move that realizes the highest immediate reward. — Building Controllers for Tetris, 2009
- 18. Example of features — Building Controllers for Tetris, 2009
- 19. ...The problem of building a Tetris controller comes down to building a good evaluation function. Ideally, this function should return high values for the good decisions and low values for the bad ones. — Building Controllers for Tetris, 2009
- 20. Reinforcement Learning context, algorithms aim at tuning the weights such that the evaluation function approximates well the optimal expected future score from each state. — Building Controllers for Tetris, 2009
- 21. Reinforcement Learning
- 22. Reinforcement Learning by The Big Bang Theory https://www.youtube.com/watch?v=tV7Zp2B_mt8&list=PLAF3D35931B692F5C
- 23. Reinforcement Learning Imagine disputar um novo jogo cuja as regras você não conhece, depois de aproximadamente uma centena de movimentos, seu oponente anuncia: “Você perdeu!”. Em resumo, isso é aprendizagem por reforço.
- 24. Supervised Learning input 1 2 3 4 5 6 7 8 …. output 1 2 9 16 25 36 49 64 …. y = f(x) -> function approximation https://www.youtube.com/watch?v=Ki2iHgKxRBo&list=PLAwxTw4SYaPl0N6-e1GvyLp5-MUMUjOKo Map inputs to output f(x) = x labels scores well 2
- 25. Unsupervised Learning x x x xx x x xx x o o o o o o o o f(x) -> clusters description o o x x x x x x x x x x o o o o o o o oo o type clusters scores well
- 26. Reinforcement Learning Agent Environment ActionReward, State behaviors scores well
- 27. Reinforcement Learning ✓ Agents take actions in an environment and receive rewards ✓ Goal is to ﬁnd the policy π that maximizes rewards ✓ Inspired by research into psychology and animal learning
- 28. Reinforcement Learning Model Given: S set of states, A set of actions, T(s, a, s') ~ P(s’ | s, a) transitional model, R reward function 5000 7000 3000 2500 1000 4000Si immediate reward future reward 13000 9000 Find: π(s) = a policy that maximizes
- 29. Needs higher computation, processing and memory.
- 30. Dynamic Programming
- 31. Dynamic Programming Solving problems by breaking it down into simpler subproblems. Solving each subproblems just once, and storing their solutions. https://en.wikipedia.org/wiki/Dynamic_programming
- 32. A G caminho ótimo A B caminho ótimo G caminho ótimo Support Property: Optimal Substructure
- 33. Fibonacci Sequence 0 1 1 2 3 5 8 13 21 The sum of two numbers before results in the follow number.
- 34. 0 1 1 2 3 5 8 13 21 f(n) = f(n-1) + f(n-2) Recursive Formula: v = 0 1 2 3 4 5 6 7 8n = Fibonacci Sequence
- 35. Fibonacci 0 1 1 2 3 5 8 13 21 0 1 2 3 4 5 6 7 8 f(6) = f(6-1) + f(6-2) f(6) = f(5) + f(4) f(6) = 5 + 3 f(6) = 8 v = n =
- 36. Fibonacci Sequence - Normal computation 6 5 4 4 3 2 2 1 1 0 2 1 2 2 1 3 3 1 0 1 0 1 0 1 0 f(n) = f(n-1) + f(n-2)
- 37. 6 5 4 4 3 2 2 1 1 0 2 1 2 2 1 3 3 1 0 1 0 1 0 1 0 Fibonacci Sequence - Normal computation O(n )2
- 38. 18 of 25 Nodes Are Repeated Calculations!
- 39. Dictionary m m[0]=0, m[1]=1 integer fib(n) if m[n] == null m[n] = fib(n-1)+ fib(n-2) return m[n] Fibonacci Sequence - Dynamic Programming
- 40. Fibonacci Sequence - Dynamic Programming 5 4 3 3 2 1 1 0 2 index value 0 1 2 3 4 5 0 1
- 41. 5 4 3 3 2 1 1 0 2 index value 0 1 2 3 4 5 0 1 1 1+0=1 Fibonacci Sequence - Dynamic Programming
- 42. 5 4 3 3 2 1 1 0 2 index value 0 1 2 3 4 5 0 1 1 2 1+0=1 1+1=2 Fibonacci Sequence - Dynamic Programming
- 43. 5 4 3 3 2 1 1 0 2 index value 0 1 2 3 4 5 0 1 1 2 3 1+0=1 1+1=2 2+1=3 Fibonacci Sequence - Dynamic Programming
- 44. 5 4 3 3 2 1 1 0 2 O(1) memory O(n) running time index value 0 1 2 3 4 5 0 1 1 2 3 5 1+0=1 1+1=2 2+1=3 3+2=5 Fibonacci Sequence - Dynamic Programming
- 45. 100 games played 31 Some scores from time… Tsitsiklis and van Roy (1996) Bertsekas and Tsitsiklis (1996)3200100 games played Kakade (2001) applied without specifying how many game scores are averaged though6800 Farias and van Roy (2006) 90 games played. 4700 — Building Controllers for Tetris, 2009
- 46. Two-piece controller with some original features of which the weights were tuned by hand. Only 1 game was played and this took a week. One-piece controller 56 games played. Tuned by hand. 660Mil 7,2Mi Currents best! Dellacherie (Fahey, 2003) Dellacherie (Fahey, 2003) — Building Controllers for Tetris, 2009
- 47. Experiment…
- 48. Experiment — Feature-Based Methods for Large Scale Dynamic Programming Experienced human Tetris player would take about 3 minutes to eliminate 30 rows.
- 49. 20 jogadores. 3 jogadas cada. 3 minutos cada jogada. Experiment cont. 30 Média obtida: 24 score
- 50. Jogador 7 (eu) jogada 1 1000 scores ~ 1 row Experiment cont.
- 51. • Média 24 score a cada 3 minutos. • Ou seja, 5.760 a cada 12h de jogo contínuo. • Um ser-humano jogando começa a ﬁcar próximo a performance dos algoritmos, após algumas otimizações, após mais ou menos 8h de jogo contínuo. Experiment cont.
- 52. Conclusão…
- 53. Dynamic Programming Reinforcement Learning Tetris Otimiza a utilização do poder computacional. Otimiza peso utilizado nas features. Utiliza feature-based para maximizar o score.
- 54. Dúvidas? Suelen Goularte Carvalho Inteligência Artiﬁcial 2015

No public clipboards found for this slide

Be the first to comment