Lucio marcenaro tue summer_school

575 views
477 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
575
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lucio marcenaro tue summer_school

  1. 1. An introduction to cognitive robotics EMJD ICE Summer School - 2013 Lucio Marcenaro – University of Genova (ITALY)
  2. 2. Cognitive robotics? • Robots with intelligent behavior – Learn and reason – Complex goals – Complex world • Robots ideal vehicles for developing and testing cognitive: – Learning – Adaptation – Classification
  3. 3. Cognitive robotics • Traditional behavior modeling approaches problematic and untenable. • Perception, action and the notion of symbolic representation to be addressed in cognitive robotics. • Cognitive robotics views animal cognition as a starting point for the development of robotic information processing.
  4. 4. Cognitive robotics • “Immobile” Robots and Engineering Operations – Robust space probes, ubiquitous computing • Robots That Navigate – Hallway robots, Field robots, Underwater explorers, stunt air vehicles • Cooperating Robots – Cooperative Space/Air/Land/Underwater vehicles, distributed traffic networks, smart dust.
  5. 5. Some applications (1)
  6. 6. Some applications (2)
  7. 7. Other examples
  8. 8. Outline • Lego Mindstorms • Simple Line Follower • Advanced Line Follower • Learning to follow the line • Conclusions
  9. 9. The NXT Unit – an embedded system • 64K RAM, 256K Flash • 32-bit ARM7 microcontroller • 100 x 64 pixel LCD graphical display • Sound channel with 8-bit resolution • Bluetooth wireless communications • Stores multiple programs – Programs selectable using buttons
  10. 10. The NXT unit (Motor ports) (Sensor ports)
  11. 11. Motors and Sensors
  12. 12. • Built-in rotation sensors NXT Motors
  13. 13. NXT Rotation Sensor • Built in to motors • Measure degrees or rotations • Reads + and - • Degrees: accuracy +/- 1 • 1 rotation = 360 degrees
  14. 14. Viewing Sensors • Connect sensor • Turn on NXT • Choose “View” • Select sensor type • Select port
  15. 15. NXT Sound Sensor • Sound sensor can measure in dB and dBA – dB: in detecting standard [unadjusted] decibels, all sounds are measured with equal sensitivity. Thus, these sounds may include some that are too high or too low for the human ear to hear. – dBA: in detecting adjusted decibels, the sensitivity of the sensor is adapted to the sensitivity of the human ear. In other words, these are the sounds that your ears are able to hear. • Sound Sensor readings on the NXT are displayed in percent [%]. The lower the percent the quieter the sound. http://mindstorms.lego.com/Overview/Sound_Sensor.aspx
  16. 16. NXT Ultrasonic/Distance Sensor • Measures distance/proximity • Range: 0-255 cm • Precision: +/- 3cm • Can report in centimeters or inches http://mindstorms.lego.com/Overview/Ultrasonic_Sensor.aspx
  17. 17. 17 NXT Non-standard sensors: HiTechnic.com • Compass • Gyroscope • Accellerometer/tilt sensor, • Color sensor • IRSeeker • Prototype board with A/D converter for the I2C bus
  18. 18. LEGO Mindstorms for NXT (NXT-G) NXT-G graphical programming language Based on the LabVIEW programming language G Program by drawing a flow chart
  19. 19. NXT-G PC program interface Toolbar Workspace Configuration Panel Help & Navigation Controller Palettes Tutorials Web Portal Sequence Beam
  20. 20. Issues of the standard firmware • Only one data type • Unreliable bluetooth communication • Limited multi-tasking • Complex motor control • Simplistic memory management • Not suitable for large programs • Not suitable for development of own tools or blocks
  21. 21. Other programming languages and environments – Java leJOS – Microsoft Robotics Studio – RobotC – NXC - Not eXactly C – NXT Logo – Lego NXT Open source firmware and software development kit
  22. 22. leJOS • A Java Virtual Machine for NXT • Freely available – http://lejos.sourceforge.net/ • Replaces the NXT-G firmware • LeJOS plug-in is available for the Eclipse free development environment • Faster than NXT-G
  23. 23. Example leJOS Program sonar = new UltrasonicSensor(SensorPort.S4); Motor.A.forward(); Motor.B.forward(); while (true) { if (sonar.getDistance() < 25) { Motor.A.forward(); Motor.B.backward(); } else { Motor.A.forward(); Motor.B.forward(); } }
  24. 24. Event-driven Control in leJOS • The Behavior interface – boolean takeControl() – void action() – void suppress() • Arbitrator class – Constructor gets an array of Behavior objects • takeControl() checked for highest index first – start() method begins event loop
  25. 25. Event-driven example class Go implements Behavior { private Ultrasonic sonar = new Ultrasonic(SensorPort.S4); public boolean takeControl() { return sonar.getDistance() > 25; }
  26. 26. Event-driven example public void action() { Motor.A.forward(); Motor.B.forward(); } public void suppress() { Motor.A.stop(); Motor.B.stop(); } }
  27. 27. Event-driven example class Spin implements Behavior { private Ultrasonic sonar = new Ultrasonic(SensorPort.S4); public boolean takeControl() { return sonar.getDistance() <= 25; }
  28. 28. Event-driven example public void action() { Motor.A.forward(); Motor.B.backward(); } public void suppress() { Motor.A.stop(); Motor.B.stop(); } }
  29. 29. Event-driven example public class FindFreespace { public static void main(String[] a) { Behavior[] b = new Behavior[] {new Go(), new Spin()}; Arbitrator arb = new Arbitrator(b); arb.start(); } }
  30. 30. Simple Line Follower • Use light-sensor as a switch • If measured value > threshold: ON state (white surface) • If measured value < threshold: OFF state (black surface)
  31. 31. Simple Line Follower • Robot not traveling inside the line but along the edge • Turning left until an “OFF” to “ON” transition is detected • Turning right until an “ON” to “OFF” transition is detected
  32. 32. Simple Line Follower NXTMotor rightM = new NXTMotor(MotorPort.A); NXTMotor leftM = new NXTMotor(MotorPort.C); ColorSensor cs = new ColorSensor(SensorPort.S2, Color.RED); while (!Button.ESCAPE.isDown()) { int currentColor = cs.getLightValue(); LCD.drawInt(currentColor, 5, 11, 3); if (currentColor < 30) { rightM.setPower(50); leftM.setPower(10); } else { rightM.setPower(10); leftM.setPower(50); } }
  33. 33. Simple Line Follower • DEMO
  34. 34. Advanced Line Follower • Use light-sensor as an Analog sensor • Sensor ranges btween 0 – 100 • Takes the average light detected over a small area
  35. 35. Advanced Line Follower • Subtract the current reading of the sensor from what the sensor should be reading – Use this value to directly control direction and power of the wheels • Multiply this value for a constant: how strongly the wheels should turn to correct its path? • Add a value to be sure that the robot is always moving forward
  36. 36. Advanced Line Follower NXTMotor rightM = new NXTMotor(MotorPort.A); NXTMotor leftM = new NXTMotor(MotorPort.C); int targetValue = 30; int amplify = 7; int targetPower = 50; ColorSensor cs = new ColorSensor(SensorPort.S2, Color.RED); rightM.setPower(targetPower); leftM.setPower(targetPower); while (!Button.ESCAPE.isDown()) { int currentColor = cs.getLightValue(); int difference = currentColor - targetValue; int ampDiff = difference * amplify; int rightPower = ampDiff + targetPower; int leftPower = targetPower; rightM.setPower(rightPower); leftM.setPower(leftPower); }
  37. 37. Advanced Line Follower • DEMO
  38. 38. Learn how to follow • Goal – Make robots do what we want – Minimize/eliminate programming • Proposed Solution: Reinforcement Learning – Specify desired behavior using rewards – Express rewards in terms of sensor states – Use machine learning to induce desired actions • Target Platform – Lego Mindstorms NXT
  39. 39. Example: Grid World • A maze-like problem – The agent lives in a grid – Walls block the agent’s path • Noisy movement: actions do not always go as planned: – 80% of the time, preferred action is taken (if there is no wall there) – 10% of the time, North takes the agent West; 10% East – If there is a wall in the direction the agent would have been taken, the agent stays put • The agent receives rewards each time step – Small “living” reward each step (can be negative) – Big rewards come at the end (good or bad) • Goal: maximize sum of rewards
  40. 40. Markov Decision Processes • An MDP is defined by: – A set of states s  S – A set of actions a  A – A transition function T(s,a,s’) • Prob that a from s leads to s’ • i.e., P(s’ | s,a) • Also called the model (or dynamics) – A reward function R(s, a, s’) • Sometimes just R(s) or R(s’) – A start state – Maybe a terminal state • MDPs are non-deterministic search problems – Reinforcement learning: MDPs where we don’t know the transition or reward functions
  41. 41. What is Markov about MDPs? • “Markov” generally means that given the present state, the future and the past are independent • For Markov decision processes, “Markov” means: Andrej Andreevič Markov (1856-1922)
  42. 42. Solving MDPs: policies • In deterministic single-agent search problems, want an optimal plan, or sequence of actions, from start to a goal • In an MDP, we want an optimal policy *: S → A – A policy  gives an action for each state – An optimal policy maximizes expected utility if followed – An explicit policy defines a reflex agent Optimal policy when R(s, a, s’) = -0.03 for all non-terminals s
  43. 43. Example Optimal Policies R(s) = -2.0R(s) = -0.4 R(s) = -0.03R(s) = -0.01
  44. 44. MDP Search Trees • Each MDP state gives an expectimax-like search tree a s s’ s, a (s,a,s’) called a transition T(s,a,s’) = P(s’|s,a) R(s,a,s’) s,a,s’ s is a state (s, a) is a q-state
  45. 45. Utilities of Sequences • In order to formalize optimality of a policy, need to understand utilities of sequences of rewards • What preferences should an agent have over reward sequences? • More or less? – [1,2,2] or [2,3,4] • Now or later? – [1,0,0] or [0,0,1]
  46. 46. Discounting • It’s reasonable to maximize the sum of rewards • It’s also reasonable to prefer rewards now to rewards later • One solution:values of rewards decay exponentially
  47. 47. Discounting • Typically discount rewards by  < 1 each time step – Sooner rewards have higher utility than later rewards – Also helps the algorithms converge • Example: discount of 0.5: – U([1,2,3])=1*1+0.5*2+0.25*3 – U([1,2,3])<U([3,2,1])
  48. 48. Stationary Preferences • Theorem if we assume stationary preferences: • Then: there are only two ways to define utilities – Additive utility: – Discounted utility:
  49. 49. Quiz: Discounting • Given: – Actions: East, West and Exit (available in exit states a, e) – Transitions: deterministic • Quiz 1: For =1, what is the optimal policy? • Quiz 2: For =0.1, what is the optimal policy? • Quiz 3: For which  are East and West equally good when in state d? 10 1 a b c d e 10 1 10 1
  50. 50. Infinite Utilities?! • Problem: infinite state sequences have infinite rewards • Solutions: – Finite horizon: • Terminate episodes after a fixed T steps (e.g. life) • Gives nonstationary policies ( depends on time left) – Discounting: for 0 <  < 1 • Smaller  means smaller “horizon” – shorter term focus • Absorbing state: guarantee that for every policy, a terminal state will eventually be reached
  51. 51. Recap: Defining MDPs • Markov decision processes: – States S – Start state s0 – Actions A – Transitions P(s’|s,a) (or T(s,a,s’)) – Rewards R(s,a,s’) (and discount ) • MDP quantities so far: – Policy = Choice of action for each state – Utility (or return) = sum of discounted rewards a s s, a s,a,s’ s’
  52. 52. Optimal Quantities • Why? Optimal values define optimal policies! • Define the value (utility) of a state s: V*(s) = expected utility starting in s and acting optimally • Define the value (utility) of a q-state (s,a): Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally • Define the optimal policy: *(s) = optimal action from state s a s s, a s,a,s’ s’
  53. 53. Gridworld V*(s) • Optimal value function V*(s)
  54. 54. Gridworld Q*(s,a) • Optimal Q function Q*(s,a)
  55. 55. Values of States • Fundamental operation: compute the value of a state – Expected utility under optimal action – Average sum of (discounted) rewards • Recursive definition of value a s s, a s,a,s’ s’
  56. 56. Why Not Search Trees? • We’re doing way too much work with search trees • Problem: States are repeated – Idea: Only compute needed quantities once • Problem: Tree goes on forever – Idea: Do a depth-limited computations, but with increasing depths until change is small – Note: deep parts of the tree eventually don’t matter if  < 1
  57. 57. Time-limited Values • Key idea: time-limited values • Define Vk(s) to be the optimal value of s if the game ends in k more time steps – Equivalently, it’s what a depth-k search tree would give from s
  58. 58. k=0
  59. 59. k=1
  60. 60. k=2
  61. 61. k=3
  62. 62. k=4
  63. 63. k=5
  64. 64. k=6
  65. 65. k=7
  66. 66. k=100
  67. 67. Value Iteration • Problems with the recursive computation: – Have to keep all the Vk *(s) around all the time – Don’t know which depth k(s) to ask for when planning • Solution: value iteration – Calculate values for all states, bottom-up – Keep increasing k until convergence
  68. 68. Value Iteration • Idea: – Start with V0 *(s) = 0, which we know is right (why?) – Given Vi *, calculate the values for all states for depth i+1: – This is called a value update or Bellman update – Repeat until convergence • Complexity of each iteration: O(S2A) • Theorem: will converge to unique optimal values – Basic idea: approximations get refined towards optimal values – Policy may converge long before values do
  69. 69. Practice: Computing Actions • Which action should we chose from state s: – Given optimal values V? – Given optimal q-values Q? – Lesson: actions are easier to select from Q’s!
  70. 70. Utilities for Fixed Policies • Another basic operation: compute the utility of a state s under a fixed (general non-optimal) policy • Define the utility of a state s, under a fixed policy : V(s) = expected total discounted rewards (return) starting in s and following  • Recursive relation (one-step look-ahead / Bellman equation): (s) s s, (s) s, (s),s’ s’
  71. 71. Policy Evaluation • How do we calculate the V’s for a fixed policy? • Idea one: modify Bellman updates • Efficiency: O(S2) per iteration • Idea two: without the maxes it’s just a linear system, solve with Matlab (or whatever)
  72. 72. Policy Iteration • Problem with value iteration: – Considering all actions each iteration is slow: takes |A| times longer than policy evaluation – But policy doesn’t change each iteration, time wasted • Alternative to value iteration: – Step 1: Policy evaluation: calculate utilities for a fixed policy (not optimal utilities!) until convergence (fast) – Step 2: Policy improvement: update policy using one-step look-ahead with resulting converged (but not optimal!) utilities (slow but infrequent) – Repeat steps until policy converges • This is policy iteration – It’s still optimal! – Can converge faster under some conditions
  73. 73. Policy Iteration • Policy evaluation: with fixed current policy , find values with simplified Bellman updates: – Iterate until values converge • Policy improvement: with fixed utilities, find the best action according to one-step look-ahead
  74. 74. Comparison • In value iteration: – Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy) • In policy iteration: – Several passes to update utilities with frozen policy – Occasional passes to update policies • Hybrid approaches (asynchronous policy iteration): – Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely often
  75. 75. Reinforcement Learning • Basic idea: – Receive feedback in the form of rewards – Agent’s utility is defined by the reward function – Must learn to act so as to maximize expected rewards – All learning is based on observed samples of outcomes
  76. 76. Reinforcement Learning • Reinforcement learning: – Still assume an MDP: • A set of states s  S • A set of actions (per state) A • A model T(s,a,s’) • A reward function R(s,a,s’) – Still looking for a policy (s) – New twist: don’t know T or R • I.e. don’t know which states are good or what the actions do • Must actually try actions and states out to learn
  77. 77. Model-Based Learning • Model-Based Idea: – Learn the model empirically through experience – Solve for values as if the learned model were correct • Step 1: Learn empirical MDP model – Count outcomes for each s,a – Normalize to give estimate of T(s,a,s’) – Discover R(s,a,s’) when we experience (s,a,s’) • Step 2: Solve the learned MDP – Iterative policy evaluation, for example (s) s s, (s) s, (s),s’ s’
  78. 78. Example: Model-Based Learning • Episodes: x y T(<3,3>, right, <4,3>) = 1 / 3 T(<2,3>, right, <3,3>) = 2 / 2 +100 -100  = 1 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done)
  79. 79. Model-Free Learning • Want to compute an expectation weighted by P(x): • Model-based: estimate P(x) from samples, compute expectation • Model-free: estimate expectation directly from samples • Why does this work? Because samples appear with the right frequencies!
  80. 80. Example: Direct Estimation • Episodes: x y (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) V(2,3) ~ (96 + -103) / 2 = -3.5 V(3,3) ~ (99 + 97 + -102) / 3 = 31.3  = 1, R = -1 +100 -100
  81. 81. Sample-Based Policy Evaluation? • Who needs T and R? Approximate the expectation with samples (drawn from T!) (s) s s, (s) s1’s2’ s3’ s, (s),s’ s’ Almost! But we only actually make progress when we move to i+1.
  82. 82. Temporal-Difference Learning • Big idea: learn from every experience! – Update V(s) each time we experience (s,a,s’,r) – Likely s’ will contribute updates more often • Temporal difference learning – Policy still fixed! – Move values toward value of whatever successor occurs: running average! (s) s s, (s) s’ Sample of V(s): Update to V(s): Same update:
  83. 83. Exponential Moving Average • Exponential moving average – Makes recent samples more important – Forgets about the past (distant past values were wrong anyway) – Easy to compute from the running average • Decreasing learning rate can give converging averages
  84. 84. Example: TD Policy Evaluation Take  = 1,  = 0.5 (1,1) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done)
  85. 85. Problems with TD Value Learning • TD value leaning is a model-free way to do policy evaluation • However, if we want to turn values into a (new) policy, we’re sunk: • Idea: learn Q-values directly • Makes action selection model-free too! a s s, a s,a,s’ s’
  86. 86. Active Learning • Full reinforcement learning – You don’t know the transitions T(s,a,s’) – You don’t know the rewards R(s,a,s’) – You can choose any actions you like – Goal: learn the optimal policy – … what value iteration did! • In this case: – Learner makes choices! – Fundamental tradeoff: exploration vs. exploitation – This is NOT offline planning! You actually take actions in the world and find out what happens…
  87. 87. Detour: Q-Value Iteration • Value iteration: find successive approx optimal values – Start with V0 *(s) = 0, which we know is right (why?) – Given Vi *, calculate the values for all states for depth i+1: • But Q-values are more useful! – Start with Q0 *(s,a) = 0, which we know is right (why?) – Given Qi *, calculate the q-values for all q-states for depth i+1:
  88. 88. Q-Learning • Q-Learning: sample-based Q-value iteration • Learn Q*(s,a) values – Receive a sample (s,a,s’,r) – Consider your old estimate: – Consider your new sample estimate: – Incorporate the new estimate into a running average:
  89. 89. Q-Learning Properties • Amazing result: Q-learning converges to optimal policy – If you explore enough – If you make the learning rate small enough – … but not decrease it too quickly! – Basically doesn’t matter how you select actions (!) • Neat property: off-policy learning – learn optimal policy without following it (some caveats)
  90. 90. Q-Learning • Discrete sets of states and actions – States form an N-dimensional array • Unfolded into one dimension in practice – Individual actions selected on each time step • Q-values – 2D array (indexed by state and action) – Expected rewards for performing actions
  91. 91. Q-Learning • Table of expected rewards (“Q-values”) – Indexed by state and action • Algorithm steps – Calculate state index from sensor values – Calculate the reward – Update previous Q-value – Select and perform an action • Q(s,a) = (1 - α) Q(s,a) + α (r + γ max(Q(s',a)))
  92. 92. • Certain sensors provide continuous values • Sonar • Motor encoders • Q-Learning requires discrete inputs • Group continuous values into discrete “buckets” • [Mahadevan and Connell, 1992] • Q-Learning produces discrete actions • Forward • Back-left/Back-right Q-Learning and Robots
  93. 93. Creating Discrete Inputs • Basic approach – Discretize continuous values into sets – Combine each discretized tuple into a single index • Another approach – Self-Organizing Map – Induces a discretization of continuous values – [Touzet 1997] [Smith 2002]
  94. 94. Q-Learning Main Loop • Select action • Change motor speeds • Inspect sensor values – Calculate updated state – Calculate reward • Update Q values • Set “old state” to be the updated state
  95. 95. Calculating the State (Motors) • For each motor: – 100% power – 93.75% power – 87.5% power • Six motor states
  96. 96. Calculating the State (Sensors) • No disparity: STRAIGHT • Left/Right disparity – 1-5: LEFT_1, RIGHT_1 – 6-12: LEFT_2, RIGHT_2 – 13+: LEFT_3, RIGHT_3 • Seven total sensor states • 63 states overall
  97. 97. Calculating Reward • No disparity => highest value • Reward decreases with increasing disparity
  98. 98. Action Set for Line Follow • MAINTAIN – Both motors unchanged • UP_LEFT, UP_RIGHT – Accelerate motor by one motor state • DOWN_LEFT, DOWN_RIGHT – Decelerate motor by one motor state • Five total actions
  99. 99. Q-learning line follower
  100. 100. Conclusions • Lego Mindstorms NXT as a conveniente platform for «cognitive robotics» • Executing a task with «rules» • Learning hot to execute a task – MDP – Reinforcement learning • Q-learning applied to Lego Mindstorms
  101. 101. Thank you! • Questions?

×