В своем выступлении я опишу наш текущий проект в Interaction Lab, на факультете математики и компьютерных наук университета Херриот-Ватт, Шотландия. Наше исследование посвящено разработке голосовой интерактивной системы, которая может эффективно и адаптивно взаимодействовать с людьми. Такие системы часто используют обучение с подкреплением (Reinforcement Learning), вычислительную модель, которая методом проб и ошибок выучивает сложные модели поведения. Недостатком таких систем является ограниченная масштабируемость, т.е. трудности при работе с большим пространством возможностей и паралелльными задачами. Я опишу три возможных решения этой проблемы: использование предыдущих знаний, повторное использование выученных стратегий и гибкое взаимодействие. Все три подхода будут проиллюстрированы действующими системами, которые тестировались на реальных пользователях. В конце я обсужу возможные направления будущей работы, нацеленной на использование систем Reinforcement Learning в реальных (неэкспериментальных) системах.
2. Mary Ellen Foster
Simon Keizer
Zhuoran Wang
Oliver Lemon Helen Hastie
Srini Janarthanam
Xingkun Liu
Verena Rieser
Dimitra Gkatzia
Nina Dethlefs Arash Eshghi
2
Heriberto
Cuayahuitl
Ioannis
Efstathiou
Wenshuo Tang
Kathin Lohan
4. Interactive Learning System/Robot
• Interactive learning machine: is an entity which
improves its performance through interacting with
other machines, its physical world and/or humans.
(Cuayáhuitl, H., et al., 2013, IJCAI-MLIS) 4
6. Outline
1.
Reinforcement
Learning (RL)
2.
Hierarchical
RL
3.
Applications
4. Related
Work
5. Future
Directions
Interactive
Learning
Systems
6
6.
Summary
7. Outline: Where are we?
1.
Reinforcement
Learning (RL)
2.
Hierarchical
RL
3.
Applications
4. Related
Work
5. Future
Directions
Interactive
Learning
Systems
7
6.
Summary
8. Interaction as a Markov Decision Process
(MDP)
● The environment is described as an MDP:
● A set of states S;
● A set of actions A;
● A state transition function T;
● A reward function R.
● The MDP solution (policy or interaction manager)
decides what to do using reinforcement learning
Pr(s2|s1,a1) Choice points
9. Reinforcement Learning is not Trivial
1030
1025
1020
1015
1010
Known Issues:
Scalability and
Partial Observability
100 101 102 100
9
105
State Space Growth
Number of Binary Variables
10. The Goal of Reinforcement Learners
The goal is to find an optimal policy:
11. How to Represent the Agent's Policy?
● Tabular representations
● Tree-based representations
● Function approximation
● Linear
● Non-linear
11
12. Reinforcement Learning Algorithms
● Q-Learning
● Q-Learning with Linear Function Approximation
(Sutton & Barto, MIT Press, 1998; Szepesvari, Morgan Clay Pub., 2010) 12
13. Illustrative Example: The Interactive Taxi
• State Trans.: 0.8 of correct navigation/recognition
• Reward:+100 for reaching the goal, 0 otherwise
• Size of state-action space:
|S*A| = 50*5^4*3*4*16 = 6M state-actions 13
14. Outline: Where are we?
1.
Reinforcement
Learning (RL)
2.
Hierarchical
RL
3.
Applications
4. Related
Work
5. Future
Directions
Interactive
Learning
Systems
14
6.
Summary
15. Hierarchical Reinforcement Learning
• Why? To learn system behaviours to carry out
multiple tasks jointly (not separately)
15
I know how to
do that, from
playing the
other game
16. Interaction as a Semi-Markov Decision
Process (SMDP)
● Environment as an SMDP:
● S: set of states
● A: set of (complex) actions
● T: state transition function
● R: reward function
● One SMDP for each task or
subtask
● Hierarchical reinforcement
learning algorithms to solve
SMDPs (e.g. HSMQ, MAXQ)
Tasks
Task
1
Task
N
Sub-task
Sub-
Task
Sub-task
Sub-
Task
The goal is to find:
16
17. Conceptual SMDP for Interactive Systems
quicker learning,
more scalability,
behaviour reuse
Bene fits
18. Hierarchical Reinforcement Learning
Algorithms
● HSMQ-Learning
● HSMQ-Learning with Linear Function Approximation
● Other HRL algorithms: MAXQ, HAMQ
● Algorithms for structure learning: HEXQ, VISA, HI-MAT
(Barto & Mahadevan, 2003; Hengst, 2010) 18
19. Illustrative Example: The Interactive Taxi
• State Trans.: 0.8 of correct navigation/recognition
• Reward:+100 for reaching the goal, 0 otherwise
• State-action space: |S*A| = 10.7K state-actions
19
20. Outline: Where are we?
1.
Reinforcement
Learning (RL)
2.
Hierarchical
RL
3.
Applications
4. Related
Work
5. Future
Directions
Interactive
Learning
Systems
20
6.
Summary
22. Application 1: Travel Planning
● HRL without prior knowledge (HSMQ-Learning)
● HRL with prior knowledge (HAM+HSMQ-Learning)
W=joint state
(SMDP+HAM)
● Training with simulated interactions
● Testing with real users
(Cuayahuitl et al., Computer, Speech & Language, 2010) 22
23. Travel Planning Spoken Dialogue System
(Cuayáhuitl et al., Computer, Speech & Language, 2010) 23
24. Results in the Travel Planning Domain
24
• HRL finds solutions faster than flat learning
• HRL is more scalable than flat learning
• Learnt policies outperform hand-coded ones
(Cuayáhuitl et al., Computer, Speech & Language, 2010)
25. Application 2: Indoor Wayfinding
● HRL without policy reuse (HSMQ-Learning)
● HRL with policy reuse (HSMQ_PR-Learning)
● Detect situations where the system knows how to act
● Action-selection using an optimal (if reuse=true) or an
exploratory policy (if reuse=false)
● Training with simulated interactions
● Testing with real users
(Cuayahuitl et al., Computer, Speech & Language, 2010) 25
27. Results in the Indoor Wayfinding Domain
• Policy reuse finds solutions faster than without it
• Adaptive route instructions are more efficient
(Cuayáhuitl & Dethlefs., ACM Trans. Speech & Lang. Proc., 2011)
27
28. Application 3: Human-Robot Interaction
● HSMQ vs. FlexHSMQ Learning w/linear function approx.
● Training with simulated interactions
● Testing with real users
(Cuayahuitl et al., Computer, Speech & Language, 2010) 28
29. Robot Dialogue System (Quiz Game)
29
Interaction
Manager
(Cuayáhuitl et al., ACM Trans. Interactive Intelligent Sys., 2014)
30. Results in the Quiz Domain
• Non-strict HRL leads to more natural interactions
• Non-strict HRL is preferred by human users
(Cuayáhuitl et al., ACM Trans. Interactive Intelligent Sys., 2014)
30
31. Robot Asking and Answering Questions
(Belpaeme, et al., 2012, Intl. Journal of HRI) 31
32. Outline: Where are we?
1.
Reinforcement
Learning (RL)
2.
Hierarchical
RL
3.
Applications
4. Related
Work
5. Future
Directions
Interactive
Learning
Systems
32
6.
Summary
35. Spectrum of Markov Process Models
Promising for
multi-task
learning
systems
35
(Mahadevan, S. et al., 2004, Handbook of Learning and Approx. Dyn. Prog.)
36. Outline: Where are we?
1.
Reinforcement
Learning (RL)
2.
Hierarchical
RL
3.
Applications
4. Related
Work
5. Future
Directions
Interactive
Learning
Systems
36
6.
Summary
37. Issues that Might Lead to Future
Interactive Learning Systems
1.Big effort to make the system perform similar tasks
2.Simulations may not represent the real world
3.It is often hard to specify the reward function
4.The real world is partially known and dynamic
5.Poor spatial cognition will affect real world impact
6.Small vocabularies discourage talking to machines
7.Lack of interactive learning systems in the real world
37
38. Towards Autonomous Interactive
Systems and Robots
Degree of autonomy
Amount of tasks
Current interactive
systems require a
lot of human
intervention
How do we
get here?
Wholistic
perspective for
language, vision
and robotics
Future interactive
systems should
be more
autonomous
38
39. Outline: Where are we?
1.
Reinforcement
Learning (RL)
2.
Hierarchical
RL
3.
Applications
4. Related
Work
5. Future
Directions
Interactive
Learning
Systems
39
6.
Summary
40. Summary
• Machines can be programmed to behave just
as expected, but the physical world and
humans demand systems that can learn
• Hierarchical learning plays an important role
for multi-tasked interactive systems and robots
• More autonomy is needed if systems are to
learn new skills with little human intervention
• A wholistic interdisciplinary perspective is
needed for intelligent interactive robots
40
41. References
• Cuayáhuitl, H., Dethlefs, N., Kruijff -Korbayová, I., (2014) Non-
Strict Hierarchical Reinforcement Learning for
Interactive Systems and Robots. To appear in ACM
Transactions on Intelligent Interactive Systems, vol. 4, no. 3.
• Cuayáhuitl, H. and Dethlefs, N., (2011), Spatially-Aware
Dialogue Control Using Hierarchical Reinforcement
Learning. In ACM Transactions on Speech and Language
Processing, vol. 7, no. 3, pp. 5:1-5:26.
• Cuayáhuitl, H., Renals, S., Lemon, O., Shimodaira, H., (2010),
Evaluation of a Hierarchical Reinforcement Learning
Spoken Dialogue System. In Computer Speech and
Language, vol. 24, no. 2, pp. 395-429.
E-Mail: hc213@hw.ac.uk
41