SlideShare a Scribd company logo
Apprendimento per Rinforzo e
Applicazione ai Problemi di
Pianificazione del Percorso
Relatore: Cristina Baroglio Candidato: Luca Marignati
12/07/2019
Tesi di Laurea in Informatica
Torino
Context RL problem TD method
Q-Learning
and Sarsa
Software Tests Conclusions
Future
developments
Outline 2
Machine
Learning
• Supervised Learning
• Non-Supervised Learning
• Reinforcement Learning
ParadigmsContext
3
Agent Environment
Actors
4
• π S → A∶
• Find optimal policy π*
Policy
• R (S, A) → reward∶
Reward function
Value function
• Optional
• Model-free approach
Model
• π S → A∶
• Find optimal policy π*
Policy
• R (S, A) → reward∶
Reward function
Value function
• Optional
• Model-free approach
Model
Other
elements
5
Methods guided
by two time
instants
instant t and instant t + 1
Model-free
Learn directly from
experience
Bootstrapping
Step-by-step incremental
approach
Off-policy/On-
policy method
Q-Learning/Sarsa
Temporal
difference
method
6
Q-Learning Sarsa
Algorithms
7
Based on Q(s,a)
• Similar to  But are focused on state-action pair
• Value of a state's utility  Quality value
• Describes the gain or loss by performing the action a in the
state s
• Total long term reward (environment knowledge)
• Bellman equation
•  
8
Initialize Q(s,a) arbitrarily
Repeat (For each episode)
Repeat (for each step of episode)
Choose a from St using policy derived from Q
(e.g.
St = St+1
Initialize St
Take action at, observe R, St+1
9
Q-Learning:
off-policy
Update Q-Value
Initialize Q(s,a) arbitrarily
Initialize St
Repeat (For each episode)
Repeat (for each step of episode)
Choose at+1 from St+1 using policy derived
from Q (e.g.
Take action at, observe R, St+1
Choose at from St using policy derived from Q
(e.g.
St = St+1; at = at+1
Sarsa:
on-policy
Update Q-Value
Similar structure
Change the update rule
Different approaches for value update rule
*(1) off-policy feature
• Action at  current policy (e.g. -greedy policy)
• Action at+1  greedy policy starting from the state st+1
*(2)  on-policy feature
• Action at  current policy (e.g. -greedy policy)
• Action at+1  current policy (e.g. -greedy policy)
Practical
problem
Path planning
12
Tools
• Languages
• JavaScript/JQuery
• HTML5 (CANVAS e API)
• Libraries
• Bootstrap  Responsive layout
• Chart.js  Algorithm performance
• FontAwesome  Icon management
13
Problem’s description
1. Single-agent system
2. Variants of environment (Grid 12x4/10x10)
3. Finite states and actions
• Finite states (48/100)
• Limited number of actions  {up, down, right, left}
4. Target  reach the goal state
5. Episodic task
6. Reward function
• −1  non-terminal States (Neutral States);
• −100  States of defeat (The Cliff)
• +100  Goal State
DEMO
WEB  https://www.reinforcementlearning.it
LOCAL  http://rl/
15
CONFIGURATION
1) Set parameter
2) Choose algorithm
3) Number of victories
4) Number of defeats
Section1
Section2
VISUALIZATION OF THE ENVIRONMENT
BUTTONS
1) Start/Stop/Accelerate learn
2) Set a Goal State
3) Set a Defeat State
4) Modal for choose positions
Section3
INFORMATION OF RESULTS
1) Average reward
2) Average moves
PERFORMANCE OF ALGORITHM
1) Chart.js
2) Verification of lerning
3) Convergence of optimal path
Q-VALUES FOR STATES
Environment configuration
19
Key Value
goalstate x: 690, y: 210
deathstate_1 x: 150, y: 270
deathstate_2 x: 210, y: 210
deathstate_3 x: 270, y: 420
deathstate_4 x: 330, y: 210
deathstate_5 x: 390, y: 530
… …
startstate x: 30, y: 210
Object representation: Key-Value Structure
Terminal State
Implementative choices (1)
• Tabular description
 Limited state space
 Cells: Q(s,a)-values (initialized to 0  no knowledge)
 Key-Value Structure
Pos. Up Down Right Left
3030 - 0 0 -
3090 0 0 0 -
9030 - 0 0 0
9090 0 0 0 0
15030 - 0 0 0
15090 0 0 0 0
… … … … …
… … … … …
… … … … …
630150 0 0 0 0
630210 0 - 0 0
690150 0 0 - 0
690210 0 - - 0
Implementative choices (2)
• ε−greedy policy (e.g. ε = 0,1)
Exploration and Exploitation Compromise
How are actions chosen?
Tests
• #Test1: Grid 12x4
• #Test2: Grid 10x10 – simple environment
• #Test3: Grid 10x10 – complex environment
• #Test4: Grid 10x10 – dynamic environment
Input
environments
Algorithm’s
choice
Trial-And-Error
Convergent
to the
optimal path
Common features
Step 1: Input environments
Different grid environmental
configurations (12x4 - 10x10)
Different degrees of difficulty
Input
environments
Algorithm’s
choice
Trial-And-Error
Convergent
to the
optimal path
Step 2: Algorithm’s choice
Common
features
Input
environments
Algorithm’s
choice
Trial-And-Error
Convergent
to the
optimal path
Step 3: Trial-and-error learning
Common features
Input
environments
Algorithm’s
choice
Trial-And-Error
Convergent
to the
optimal path
Step 4: Convergent to the optimal path
Common features
1. Can an agent learn without having examples of correct behavior? 
Difference with Supervised Learning
2. Study of methods for Reinforcement Learning and understanding of the
basic principles that characterize them (notions of agent, environment,
MDP, ...)
3. Focused on the study of TD methods (Sarsa and Q-Learning)
4. Analysis of a practical problem: Path Planning
5. Software JavaScript  the agent is able to adapt to any type of
environment provided as input in order to achieve the set objective
6. Different nature of the Sarsa and Q-Learning algorithms
Conclusions
Sarsa Q-Learning
Safe path Speed path
Prudent policy Risky attitude
Not suitable for complex
environments
Suitable for any type of
environment
Optimize the agent's
performance
Train agents in simulated
environments
Expensive mistakes  keep
the risk away
Errors do not involve large
losses
Model-free  expensive adaptation changes (TD property)
Conclusions
• Real problem RL
• Partial Observable Markov Decision Problems (POMDP)
• Model-based algorithm
• Better learning policy (e.g. Soft-max)
• Change Q-table with Artificial Neural Networks
(e.g. chess game  states space = 10120)
• Continuous tasks (not episodic)
• Multi-agent system (opponent agent)
Future developments
Domande?
Relatore: Cristina Baroglio Candidato: Luca Marignati
12/07/2019
Grazie per l’attenzione!
Torino

More Related Content

What's hot

Presentazione Tesi Eda
Presentazione Tesi EdaPresentazione Tesi Eda
Presentazione Tesi Eda
lab13unisa
 
Presentazione Tesi di Laurea Francesco Ruggieri
Presentazione Tesi di Laurea Francesco RuggieriPresentazione Tesi di Laurea Francesco Ruggieri
Presentazione Tesi di Laurea Francesco Ruggieri
Francesco Ruggieri
 
Presentazione Tesi Eda
Presentazione Tesi EdaPresentazione Tesi Eda
Presentazione Tesi Eda
guestafe0ba
 
Presentazione tesi laurea conservazione restauro dei beni culturali
Presentazione tesi laurea conservazione restauro dei beni culturaliPresentazione tesi laurea conservazione restauro dei beni culturali
Presentazione tesi laurea conservazione restauro dei beni culturali
Allegra Carlone
 

What's hot (20)

Modello tesi presentazione
Modello tesi presentazione Modello tesi presentazione
Modello tesi presentazione
 
Presentazione tesi magistrale
Presentazione tesi magistralePresentazione tesi magistrale
Presentazione tesi magistrale
 
Presentazione Tesi Eda
Presentazione Tesi EdaPresentazione Tesi Eda
Presentazione Tesi Eda
 
Università Di Salerno Presentazione Tesi Gaetano Costa
Università Di Salerno   Presentazione Tesi Gaetano CostaUniversità Di Salerno   Presentazione Tesi Gaetano Costa
Università Di Salerno Presentazione Tesi Gaetano Costa
 
Power Point - Tesi Triennale
Power Point - Tesi TriennalePower Point - Tesi Triennale
Power Point - Tesi Triennale
 
Internazionalizzazione andata e ritorno
Internazionalizzazione andata e ritornoInternazionalizzazione andata e ritorno
Internazionalizzazione andata e ritorno
 
Slides Presentazione Tesi di Laurea Magistrale
Slides Presentazione Tesi di Laurea MagistraleSlides Presentazione Tesi di Laurea Magistrale
Slides Presentazione Tesi di Laurea Magistrale
 
Presentazione Tesi di Laurea Francesco Ruggieri
Presentazione Tesi di Laurea Francesco RuggieriPresentazione Tesi di Laurea Francesco Ruggieri
Presentazione Tesi di Laurea Francesco Ruggieri
 
Presentazione Tesi di Laurea Triennale
Presentazione Tesi di Laurea Triennale Presentazione Tesi di Laurea Triennale
Presentazione Tesi di Laurea Triennale
 
Presentazione Tesi Eda
Presentazione Tesi EdaPresentazione Tesi Eda
Presentazione Tesi Eda
 
Presentazione Tesi Valentina Becherucci
Presentazione Tesi Valentina BecherucciPresentazione Tesi Valentina Becherucci
Presentazione Tesi Valentina Becherucci
 
Transformer in Vision
Transformer in VisionTransformer in Vision
Transformer in Vision
 
RealSenseとYOLOを用いた物体把持
RealSenseとYOLOを用いた物体把持RealSenseとYOLOを用いた物体把持
RealSenseとYOLOを用いた物体把持
 
Discorso tesi
Discorso tesiDiscorso tesi
Discorso tesi
 
確率ロボティクス第12回
確率ロボティクス第12回確率ロボティクス第12回
確率ロボティクス第12回
 
Presentazione tesi laurea conservazione restauro dei beni culturali
Presentazione tesi laurea conservazione restauro dei beni culturaliPresentazione tesi laurea conservazione restauro dei beni culturali
Presentazione tesi laurea conservazione restauro dei beni culturali
 
Physically Based Sky, Atmosphere and Cloud Rendering in Frostbite
Physically Based Sky, Atmosphere and Cloud Rendering in FrostbitePhysically Based Sky, Atmosphere and Cloud Rendering in Frostbite
Physically Based Sky, Atmosphere and Cloud Rendering in Frostbite
 
Build, train and deploy ML models at scale.pdf
Build, train and deploy ML models at scale.pdfBuild, train and deploy ML models at scale.pdf
Build, train and deploy ML models at scale.pdf
 
Introductory Level of SLAM Seminar
Introductory Level of SLAM SeminarIntroductory Level of SLAM Seminar
Introductory Level of SLAM Seminar
 
Computer vision for transportation
Computer vision for transportationComputer vision for transportation
Computer vision for transportation
 

Similar to Presentazione Tesi Laurea Triennale in Informatica

Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementGoal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Emil Lupu
 

Similar to Presentazione Tesi Laurea Triennale in Informatica (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Horizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at Scale
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
 
Nbvtalkonfeatureselection
NbvtalkonfeatureselectionNbvtalkonfeatureselection
Nbvtalkonfeatureselection
 
Policy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detectionPolicy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detection
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementGoal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Simulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingSimulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous Driving
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
What Can RL do.pptx
What Can RL do.pptxWhat Can RL do.pptx
What Can RL do.pptx
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Rapid pruning of search space through hierarchical matching
Rapid pruning of search space through hierarchical matchingRapid pruning of search space through hierarchical matching
Rapid pruning of search space through hierarchical matching
 

More from Luca Marignati

More from Luca Marignati (6)

Who Controls the Internet? Illusions of a Borderless World di Jack Goldsmith ...
Who Controls the Internet? Illusions of a Borderless World di Jack Goldsmith ...Who Controls the Internet? Illusions of a Borderless World di Jack Goldsmith ...
Who Controls the Internet? Illusions of a Borderless World di Jack Goldsmith ...
 
Cookie
CookieCookie
Cookie
 
Jade - Programming of intelligent agents
Jade - Programming of intelligent agentsJade - Programming of intelligent agents
Jade - Programming of intelligent agents
 
Advanced Database Models and Architectures: Big Data: MySQL VS MongoDB
Advanced Database Models and Architectures: Big Data: MySQL VS MongoDBAdvanced Database Models and Architectures: Big Data: MySQL VS MongoDB
Advanced Database Models and Architectures: Big Data: MySQL VS MongoDB
 
Dal modello a memoria condivisa al modello a rete, impossibilità del consenso...
Dal modello a memoria condivisa al modello a rete, impossibilità del consenso...Dal modello a memoria condivisa al modello a rete, impossibilità del consenso...
Dal modello a memoria condivisa al modello a rete, impossibilità del consenso...
 
BenOr Simulation - A randomized algorithm for solving the consensus problem ...
BenOr Simulation  - A randomized algorithm for solving the consensus problem ...BenOr Simulation  - A randomized algorithm for solving the consensus problem ...
BenOr Simulation - A randomized algorithm for solving the consensus problem ...
 

Recently uploaded

CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
Kamal Acharya
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsRS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
Atif Razi
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
Kamal Acharya
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
Kamal Acharya
 

Recently uploaded (20)

RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsRS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering Workshop
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
fundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projectionfundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projection
 
Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltage
 
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docxThe Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
The Ultimate Guide to External Floating Roofs for Oil Storage Tanks.docx
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 

Presentazione Tesi Laurea Triennale in Informatica

  • 1. Apprendimento per Rinforzo e Applicazione ai Problemi di Pianificazione del Percorso Relatore: Cristina Baroglio Candidato: Luca Marignati 12/07/2019 Tesi di Laurea in Informatica Torino
  • 2. Context RL problem TD method Q-Learning and Sarsa Software Tests Conclusions Future developments Outline 2
  • 3. Machine Learning • Supervised Learning • Non-Supervised Learning • Reinforcement Learning ParadigmsContext 3
  • 5. • π S → A∶ • Find optimal policy π* Policy • R (S, A) → reward∶ Reward function Value function • Optional • Model-free approach Model • π S → A∶ • Find optimal policy π* Policy • R (S, A) → reward∶ Reward function Value function • Optional • Model-free approach Model Other elements 5
  • 6. Methods guided by two time instants instant t and instant t + 1 Model-free Learn directly from experience Bootstrapping Step-by-step incremental approach Off-policy/On- policy method Q-Learning/Sarsa Temporal difference method 6
  • 8. Based on Q(s,a) • Similar to  But are focused on state-action pair • Value of a state's utility  Quality value • Describes the gain or loss by performing the action a in the state s • Total long term reward (environment knowledge) • Bellman equation •   8
  • 9. Initialize Q(s,a) arbitrarily Repeat (For each episode) Repeat (for each step of episode) Choose a from St using policy derived from Q (e.g. St = St+1 Initialize St Take action at, observe R, St+1 9 Q-Learning: off-policy Update Q-Value
  • 10. Initialize Q(s,a) arbitrarily Initialize St Repeat (For each episode) Repeat (for each step of episode) Choose at+1 from St+1 using policy derived from Q (e.g. Take action at, observe R, St+1 Choose at from St using policy derived from Q (e.g. St = St+1; at = at+1 Sarsa: on-policy Update Q-Value Similar structure Change the update rule
  • 11. Different approaches for value update rule *(1) off-policy feature • Action at  current policy (e.g. -greedy policy) • Action at+1  greedy policy starting from the state st+1 *(2)  on-policy feature • Action at  current policy (e.g. -greedy policy) • Action at+1  current policy (e.g. -greedy policy)
  • 13. Tools • Languages • JavaScript/JQuery • HTML5 (CANVAS e API) • Libraries • Bootstrap  Responsive layout • Chart.js  Algorithm performance • FontAwesome  Icon management 13
  • 14. Problem’s description 1. Single-agent system 2. Variants of environment (Grid 12x4/10x10) 3. Finite states and actions • Finite states (48/100) • Limited number of actions  {up, down, right, left} 4. Target  reach the goal state 5. Episodic task 6. Reward function • −1  non-terminal States (Neutral States); • −100  States of defeat (The Cliff) • +100  Goal State
  • 16. CONFIGURATION 1) Set parameter 2) Choose algorithm 3) Number of victories 4) Number of defeats Section1
  • 17. Section2 VISUALIZATION OF THE ENVIRONMENT BUTTONS 1) Start/Stop/Accelerate learn 2) Set a Goal State 3) Set a Defeat State 4) Modal for choose positions
  • 18. Section3 INFORMATION OF RESULTS 1) Average reward 2) Average moves PERFORMANCE OF ALGORITHM 1) Chart.js 2) Verification of lerning 3) Convergence of optimal path Q-VALUES FOR STATES
  • 19. Environment configuration 19 Key Value goalstate x: 690, y: 210 deathstate_1 x: 150, y: 270 deathstate_2 x: 210, y: 210 deathstate_3 x: 270, y: 420 deathstate_4 x: 330, y: 210 deathstate_5 x: 390, y: 530 … … startstate x: 30, y: 210 Object representation: Key-Value Structure Terminal State
  • 20. Implementative choices (1) • Tabular description  Limited state space  Cells: Q(s,a)-values (initialized to 0  no knowledge)  Key-Value Structure Pos. Up Down Right Left 3030 - 0 0 - 3090 0 0 0 - 9030 - 0 0 0 9090 0 0 0 0 15030 - 0 0 0 15090 0 0 0 0 … … … … … … … … … … … … … … … 630150 0 0 0 0 630210 0 - 0 0 690150 0 0 - 0 690210 0 - - 0
  • 21. Implementative choices (2) • ε−greedy policy (e.g. ε = 0,1) Exploration and Exploitation Compromise How are actions chosen?
  • 22. Tests • #Test1: Grid 12x4 • #Test2: Grid 10x10 – simple environment • #Test3: Grid 10x10 – complex environment • #Test4: Grid 10x10 – dynamic environment
  • 23. Input environments Algorithm’s choice Trial-And-Error Convergent to the optimal path Common features Step 1: Input environments Different grid environmental configurations (12x4 - 10x10) Different degrees of difficulty
  • 27. 1. Can an agent learn without having examples of correct behavior?  Difference with Supervised Learning 2. Study of methods for Reinforcement Learning and understanding of the basic principles that characterize them (notions of agent, environment, MDP, ...) 3. Focused on the study of TD methods (Sarsa and Q-Learning) 4. Analysis of a practical problem: Path Planning 5. Software JavaScript  the agent is able to adapt to any type of environment provided as input in order to achieve the set objective 6. Different nature of the Sarsa and Q-Learning algorithms Conclusions
  • 28. Sarsa Q-Learning Safe path Speed path Prudent policy Risky attitude Not suitable for complex environments Suitable for any type of environment Optimize the agent's performance Train agents in simulated environments Expensive mistakes  keep the risk away Errors do not involve large losses Model-free  expensive adaptation changes (TD property) Conclusions
  • 29. • Real problem RL • Partial Observable Markov Decision Problems (POMDP) • Model-based algorithm • Better learning policy (e.g. Soft-max) • Change Q-table with Artificial Neural Networks (e.g. chess game  states space = 10120) • Continuous tasks (not episodic) • Multi-agent system (opponent agent) Future developments
  • 30. Domande? Relatore: Cristina Baroglio Candidato: Luca Marignati 12/07/2019 Grazie per l’attenzione! Torino

Editor's Notes

  1. <number>
  2. <number>
  3. <number>
  4. <number>
  5. <number>
  6. <number>
  7. <number>
  8. <number>
  9. <number>
  10. <number>
  11. <number>
  12. <number>
  13. <number>
  14. <number>