Generative Artificial Intelligence: How generative AI works.pdf
Particle Filter on Episode
1. Particle Filter on Episode
for Learning Decision Making Rule
Ryuichi Ueda Chiba Inst. of Technology
Kotaro Mizuta RIKEN BSI
Hiroshi Yamakawa DOWANGO
Hiroyuki Okada Tamagawa Univ.
2. navigation problems in the real world
• Not only robots, but also animals solve them.
• Mammals have specialized cells for spatial recognition in
their brain.
– especially around the hippocampus
– ex. place cells
• They show different reaction at each
place of environment.
• -> existence of maps in the brain
July 6th, 2016 IAS-14 Shanghai 2
Place cells [O'Keefe71]
(http://en.wikipedia.org/
wiki/Place_cell)
3. map vs. memory
• Mammals have maps in their brains.
• Maps of environments are of concern also in robotics.
– SLAM has been one of the most important topic.
– studies introducing the function of the hippocampus
• RatSLAM [Milford08]
• How about memory?
– Memory is also handled in the hippocampus.
– Sequence of memory is reduced to maps (or state space models).
– Robots can record its memory for long time if they has TB level
storages. (difference between mammals and robots)
July 6th, 2016 IAS-14 Shanghai 3
4. the purpose
• our intuition
– If memory is the source of maps, robots will be able to decide
its action not from a map but directly from memory.
– Some knowledge about handling of memory in the
hippocampus and its surroundings will help this attempt.
• to implement a learning algorithm that directly utilizes
memory
– particle filter on episode (PFoE)
– validation with an actual robot
July 6th, 2016 IAS-14 Shanghai 4
5. related works
• Episode-based reinforcement learning [Unemi 1999]
– Its base idea is identical with PFoE.
– PFoE simplifies implementation
and enables real-time calculation.
• RatSLAM [Milford08]
– an algorithm for robotics utilizing the knowledge around the
hippocampus
July 6th, 2016 IAS-14 Shanghai 5
6. outline of PFoE
• In repetitions of a task for learning, a robot stores events.
– an event = a set of sensor readings, actions, and rewards given by
someone obtained at a discrete time step
– the episode: the sequence of the events
• The degree of recall of each event is represented as a
probability.
July 6th, 2016 IAS-14 Shanghai 6
time axis
states
episode
rewards
belief
s s s s s s s
present time
1 -1
a a a a a a a actions
past current
7. decision with the belief and the episode
• An action is chosen by calculation of expectation values.
July 6th, 2016 IAS-14 Shanghai 7
time axis
states
episode
rewards
belief
s s s s s s s
present time
?
1 -1
a a a a a a a actions
When the robot recalls these
events, it may obtain +1
reward if it chooses the action
as those time.
When the robot recalls these
events, it should change its
action to avoid -1 reward.
8. representation with particles
• The belief is represented with particles.
– O(N) even if the episode has infinite length
• variables of a particle
– its position on the time axis
– its weight
July 6th, 2016 IAS-14 Shanghai 8
time axis
belief
present time
a particle
9. operation of PFoE – motion update
• When the current time goes to the next time step,
particles simply shift to their next time steps.
– The episode is extended by an additional event.
– Positions of particles are shifted.
July 6th, 2016 IAS-14 Shanghai 9
before an action
time axis
belief
after the action
time axis
belief
addition of
the event
10. operation of PFoE – sensor update
• The event related to each particle is compared
to the last one.
– Weights are reduced responding to the difference.
• resampled and normalized after reduction of weights
• When the sum of weights before normalization is under
a threshold, all particles are replaced (a reset).
– how to reset?
July 6th, 2016 IAS-14 Shanghai 10
time axis
belief
difference of sensor readings, the reward, or the action
e e e e e e
compare
events
11. operation of PFoE – retrospective resets
• inspired by the retrospective activity of place cells
– When a rat recalls past events, place cells become active as if
the rat virtually moves.
• algorithm
– 1. place particles randomly
– 2. replay the motion update and the sensor update for
M steps with the past M events from the current time
July 6th, 2016 IAS-14 Shanghai 11
time axis
belief
currentM step before
...
moved and
compared
e e e
12. experiments
• the robot: a micromouse
that has 4 range sensors
• T-maze that has a reward
at one of its arms.
• The robot chooses a turn right
action or a turn left action
at the T-junction.
• State transition is simplified to cycles of 4 events.
– The robot records an event when
• it is placed on the initial position
• it reaches the T-junction
• it turns right or left
• it goes to an end of the arm
July 6th, 2016 IAS-14 Shanghai 12
direction of
sensors
a marker of
reward
13. tasks of experiments
• a periodical task
– The reward is put right or left alternately.
– cycles of 8 events
• a discrimination task
– The reward is put the side
where the robot is placed at first.
– Right or left is chosen randomly.
• not periodical
• 1000 particles
• 50 trials in an episode x 5 sets
July 6th, 2016 IAS-14 Shanghai 13
14. periodical task with/without the retro. reset
• Retrospective resets reallocate particles effectively.
July 6th, 2016 IAS-14 Shanghai 14
with random
reset
with
the reset
15. discrimination task
• comparison of thresholds for retro. resets
• A higher threshold gives signs of learning.
– Particles are replaced frequently and go over the cyclic state
transition.
– But it is not perfect.
July 6th, 2016 IAS-14 Shanghai 15
0.2 (not frequent) 0.5 (frequent)
16. conclusion
• Particle Filter on Episode (PFoE)
– estimates the relation between current and past,
– has an ability of real-time learning, and
– does not require an environmental model except for the Bayes
model on the sensor update.
• experimental results
– It works on the actual robot.
– The simple periodical task can be learned within 20 trials.
– The discrimination task can be partially learned (75% success).
• It seems that the idea of the retrospective resetting should
be extended for non-periodical tasks. (future work)
July 6th, 2016 IAS-14 Shanghai 16
17. periodical task again
with different threshold
• to check ill effects of the high threshold for
retrospective resettings in the periodical task
• result: no ill effects can be seen
July 6th, 2016 IAS-14 Shanghai 17
0.2 0.5