MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement learning

ETeMoX: explaining
reinforcement learning
J. M. Parra-Ullauri1
, A. García-Domínguez2
, N. Bencomo3
,
C. Zheng4
, C. Zhen5
, J. Boubeta-Puig6
, G. Ortiz6
, S. Yang7
1: Aston University, 2: University of York, 3: Durham University
4: University of Oxford, 5: University of Science and Technology of China
6: University of Cadiz, 7: Edinburgh Napier University
MODELS 2022 - Thursday October 27th
, 2022

October 27th
, 2022 Event-driven temporal models for explanations - ETeMoX: explaining reinforcement learning
Rising need for explanations in self-adaptation / AI
● Software is being written to deal with more and more complex environments,
where they need to reconfigure themselves and learn from experience
● If not careful, these systems can be “black boxes” where we can only take
their decisions at face value - it will be hard to calibrate our trust on them
● We want to be able to ask things like:
○ Why did they take that action?
○ Why did they not take that *other* action?
○ How do you (roughly) work?
● The “right to explanation” is being enshrined in the GDPR, or the IEEE P7001
standard for transparency of autonomous systems
● There is an entire field on eXplainable AI (XAI)
2

October 27th
Types of explanations and stages involved
● Explanations can be broadly classified by scope into:
○ Local - for a specific decision
○ Global - for the overall behaviour of the system (usually, a simplified behavioural model)
● Adadi et al. identified four uses for explanations:
○ To justify decisions impacting people
○ To control systems into an envelope of good behaviour ⬅
○ To discover knowledge from the system behaviour
○ To improve the system by highlighting flaws ⬅
● Neerincx considered three stages for producing these explanations:
○ Generation - obtain necessary data and reason about it ⬅
○ Communication - show it to consumer (human / system) ⬅
○ Reception - was it effective and efficient?
3

October 27th
How can MDE help XAI?
● In Model-Driven Engineering (MDE), we already have significant experience
abstracting away unnecessary complexity
● At design time, we raise the level of abstraction so developers of a system
can think in terms of their domain concepts
● We can also do this while the system is running - we can build a model of
what the system is perceiving, thinking, and doing (a runtime model)
● If we decide on a common trace metamodel for this, we can reuse efforts on
introducing explainability across systems
4

October 27th
MDE: Reusable Trace Metamodel - common half
5

October 27th
MDE: Reusable
Trace Metamodel -
specific half
● First half of the metamodel is
reusable across systems
making their own decisions
● Second half of the metamodel is
specific - this one is for systems
using Q-Learning (a type of
Reinf. Learning)
● A decision takes into account
the Q-values of each Action
● Observations have rewards
associated to them, and map to
a state in the Q-table
6

October 27th
MDE: Indexing
Models into
Temporal Graph DBs
● At each system timeslice, the
runtime model is indexed into a
temporal graph
● Efficient representation of a
graph’s full history, using
copy-on-write state chunks
● Implemented by Greycat (from
Hartmann et al.), and used by
Eclipse Hawk for automated
model indexing
● More details here:
https://www.eclipse.org/hawk/ad
vanced-use/temporal-queries/
7

October 27th
MDE: History-Aware
Model Querying
● For explanation generation, we
can query the temporal graph
● We created a Hawk-specific
dialect of EOL with time-aware
predicates and properties
● More details in our MODELS
2019 paper
AGD, NB, JMPU and LGP,
‘Querying and annotating model
histories with time-aware patterns’,
http://dx.doi.org/10.1109/MODELS.
2019.000-2
8
Version traversal x.versions, x.next, x.prev,
x.time, x.earliest, x.latest…
Temporal assertions x.always(version | p),
x.never(v | p),
x.eventually(v | p)...
Predicate-based scoping x.since(v | p), x.until(v | p)...
Context-based scoping x.sinceThen, x.untilThen…
Unscoping x.unscoped

October 27th
Scaling up temporal graphs to large event volumes
● We first applied history-based expls. to Bayesian Learning-based systems
○ Partially Observable Markov Decision Processes (POMDP)
○ Had a case study on data mirroring over the network (Remote Data Mirroring)
○ Wasn’t too resource-intensive (we could simply record all versions)
● Then we tried applying it to a Reinforcement Learning system
○ Tens of training epochs, each with thousands of episodes
○ Original RL system had per-timeslice MongoDB with GBs of records to be indexed
○ RL system changed to send updates directly to Hawk - CoW reduced storage needs
○ Still a lot of history to go through - queries could take a long time!
● Do we really need all this history?
○ Answer: No.
○ How do we select the “right” moments, without imposing too much load?
9

October 27th
Event-Driven Monitoring: Complex Event Processing
10

October 27th
Event-driven Temporal Models for eXplanations (ETeMoX)
11

October 27th
Case study: airborne base stations
12

October 27th
Experiment 1:
Evolution of metrics
(optionally sampled)
● We evaluated the impact of
sampling at different rates on
the accuracy of a query
providing the historic reward
values during the RL training
● We set up the CEP engine with
Esper EPL rules as in the top
right
● We observed linear decreases
in storage required depending
on sampling rate
● 10% sampling is safe, more
than that depended on the RL
algorithm (DQN is sensitive!)
13
Q-Learning DQN

October 27th
Experiment 2: Exploration vs Exploitation (1/2)
14
● RL systems don’t always pick the best
option (exploitation): they try things
sometimes to learn more (exploration)
● How often does this happen?
● We compared two approaches to track this:
a. CEP pattern to detect exploration/exploitation and
only index episodes with exploration
b. EOL query on full history, to check CEP pattern
correctness
● Q-Learning explored 1.41% of the time,
SARSA 7.99%, DQN 7.82%

October 27th
Experiment 2: Exploration vs Exploitation (2/2)
15
● We tried using the exploration CEP rule as a filter for metric evolution, too

October 27th
Experiment 3: user handovers between stations
● To provide continuous service, a user may be handed over to another station
● We wrote an EOL query to detect handovers in the system history
○ Handover: signal-to-noise ratio changes significantly across stations between timepoints
○ Found 1784 handovers in Q-Learning, 590 in SARSA, 82176 in DQN
● These queries required many checks:
○ 10 episodes, 2000 time steps
○ 2 stations, 1050 users
○ All together: 42M combinations to check!
● Required times:
○ 917s for Q-Learning
○ 1,132s for SARSA
○ 7,914s for DQN
16

October 27th
What’s next?
● Optimise queries via Hawk timeline annotations and CEP time windows
● Explanations for other uses:
○ Human-in-the-loop (SAM 2022 presentation on Monday showed early work)
○ Hyper-parameter optimisation (external system requests explanations and drives change)
○ Global explanations of the system behaviour (event graphs)
● Studying explanation reception:
○ Effectiveness of explanation formats: plots, results, generated text, diagrams…
○ Focused on system developers so far: look into less technical audiences
○ Consider existing models for evaluation
■ Technology Acceptance Model (Davis)
■ XAI metrics (Rosenfeld)
17

Thank you!
Antonio García-Domínguez
@antoniogado - a.garcia-dominguez@york.ac.uk
Juan Marcelo Parra-Ullauri
j.parra-ullauri@aston.ac.uk
18

MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement learning

Recommended

Recommended

More Related Content

Similar to MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement learning

Similar to MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement learning (20)

More from Antonio García-Domínguez

More from Antonio García-Domínguez (17)

Recently uploaded

Recently uploaded (20)

MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement learning