Explainable Online Reinforcement Learning for Adaptive Systems

© SSE, Prof. Dr. Andreas Metzger
Explainable Online Reinforcement Learning for
Adaptive Systems
Felix Feit, Andreas Metzger, Klaus Pohl
(with contributions from Jan Laufer)
SE 2023, Paderborn
F. Feit, A. Metzger, K. Pohl, “Explainable Online Reinforcement Learning for
Adaptive Systems”, in Software Engineering 2023, Fachtagung des GI-
Fachbereichs Softwaretechnik, 21.-24. Februar 2023, Paderborn, ser. LNI, G.
Engels, R. Hebig, M. Tichy, Eds., vol. P-332. Gesellschaft für Informatik e.V.,
2023, pp. 53–54.

©
SSE,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Agenda
1. Fundamentals
2. Problem
3. Solution
4. Discussion
SE2023 2

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Fundamentals
(Self-)adaptive Software Systems [Salehie & Tahvildari, 2009; Weyns, 2021]
• Observe changes in environment, requirements and themselves
• Modify their structure, parameters and behavior
Adaptive Software Life-Cycle Model
SE2023 3
DEV OPS
self-observe
ADAPT
self-modify
manual human-in-the-loop autonomous

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Fundamentals
Defining when to adapt
• Requires anticipating all relevant
environment states
• Infeasible in most cases due to incomplete
information at design time
Example:
• Concrete services dynamically bound during
execution not known at design time
Defining how to adapt
• Requires knowing concrete effect of adaptation
• Concrete effect typically not precisely known at
design time
Example:
• Deactivating recommendation engine has positive
impact on performance
• But: How much exactly?
SE2023 4
Engineering Challenge: „Design Time Uncertainty“ [Weyns et al. 2013; Weyns, 2020; Weyns et al., 2022]

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Online Reinforcement Learning
Fundamental RL Model
[Sutton & Barto, 2018]
• Reward function R defines learning goal
• RL agent aims to optimize cumulative rewards
Classical RL (e.g., Q-Learning)
• In each learning step:
either exploitation or exploration
• Exploitation: Reuse of existing knowledge
• Exploration: Collection of new knowledge
Deep RL [Palm et al., 2020]
• Knowledge represented as
Deep Artificial Neuronal Network
• Natural handling of non-stationarity due to
stochastic action selection (sampling)
• Handling of continuous states and actions
• Generalization over unseen, neighbouring states
SE2023 5
Action A
State S
Reward R
Action
Selection
Next state S’
RL Agent
Policy
Policy
Update
Environ-
ment
Textbook example “Cliff Walk“:
Actions =
{UP, DOWN,
LEFT, RIGHT}
Reward
[Sutton & Barto, 2018]
States:

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Online Reinforcement Learning for Adaptive Systems
6
MAPE-K Model Integrated Model
RL Model
Self-Adaptation Logic
Analyze
Monitor Execute
Plan
Knowledge
Online RL for SAS
Execute
Policy
(K)
Monitor
Action
Selection
(A + P)
Policy
Update
Action
A
State S
Reward R
Next state S’
Action A
State S
Reward R
Action
Selection
Next state S’
RL Agent
Policy
Policy
Update
Environ-
ment
Integrating MAPE-K and Reinforcement Learning [Metzger et al., 2022]
SEA4DQ@ESEC

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Agenda
1. Fundamentals
2. Problem
3. Solution
4. Discussion
SE2023 7

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Decision-making of RL agent not transparent
Reward function often too complex to deduce dynamic RL behavior
SE2023 8
Example: Webshop (SEAMS SWIM Exemplar)
• Goal: adapt system to balance different concerns
Example: Prescriptive Business Process Monitoring System
• Goal: find best time to raise alarm for adapting process
Alarm No Alarm

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Practically impossible to deduce decision-making by only observing
“interface”, i.e., R, S, A
SE2023 9
R
S, A
Example:
Webshop

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Knowledge in Deep RL not explicitly represented
SE2023 10
State
S
Action
A
Classical RL (Q-Learning)
Actions A = {UP, DOWN,
LEFT, RIGHT}
Reward
States S = {0, …, 47}
0 11
24 35
24 25 26 27 28 29 30 31 32 33 34 35
UP -13,36 -12,57 -11,73 -10,74 -9,95 -8,91 -7,99 -6,98 -5,95 -4,92 -3,93 -2,98
RIGHT -12,00 -11,00 -10,00 -9,00 -8,00 -7,00 -6,00 -5,00 -4,00 -3,00 -2,00 -1,93
LEFT -12,99 -13,00 -11,98 -10,95 -9,99 -8,89 -7,97 -6,98 -5,94 -4,81 -3,94 -2,98
DOWN -13,95 -112,18 -112,80 -111,49 -112,13 -112,68 -112,91 -112,42 -111,81 -110,62 -112,86 -1,00
Action
A
State S:
Deep RL
Falling into the cliff Reaching the Goal (G)
Example:
Cliff Walk

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Agenda
1. Fundamentals
2. Problem
3. Solution
4. Discussion
SE2023 11

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
XRL-DINE
Reward Decomposition
[Juozapaitis et al., 2019]
Decompose reward function to explain short-
term goal orientation of RL (train sub-RL agents)
Pro
• Helpful in the presence of multiple, “competing”
quality goals for learning
• Provides contrastive (counterfactual) explanations
Con
• No indication of explanation’s relevance
• Requires manually selecting relevant explanations
 cognitive overhead
Interestingness Elements
[Sequeira et al., 2020]
Identify relevant moments of interaction
between agent and environment at runtime
Pro
• Facilitates automatically selecting relevant
interactions to be explained
Con
• Does not explain whether RL behaves as expected
and for the right reasons
12
Provide Insights into Decision Making of Online RL
Decomposed Interestingness Elements (DINEs)
SE2023
+

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
XRL-DINE
Uncertain Action
Whether RL in given state is
uncertain (wide range of actions)
or certain
Reward Channel
Extremum
Points after local minimum /
maximum of state-value indicate
decisions in potentially
critical states
Reward Channel
Dominance
Influence that each sub-agent has
on each possible action
13
Three Types of DINEs
SE2023
Visualization in Dashboard
Certain Actions Uncertain Actions
Action 2 Action 2
Action 4
Action 4
Action 4
Action 2 Action 4
Action 1
Action 2
Action 1 Action 1
time step
Minimum Maximum
time step
Action 1 Action 2 Action 3 Action 4
Reward Channel 1 Reward Channel 2 Reward Channel 3
Action 1 Action 2 Action 3 Action 4
Reward Channel 1 Reward Channel 2 Reward Channel 3
Absolute Reward Channel Dominance Relative Reward Channel Dominance
Action 1 Action 2 Action 3 Action 4 Action 5 Action 1 Action 2 Action 3 Action 4 Action 5

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
XRL-DINE
Application to
SWIM Exemplar
SE2023 14

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
XRL-DINE
SE2023 15
User Study
RQ1: Task performance?
 8 different tasks carried out in the role of a software engineer
RQ2: Technology Acceptance?
 12 questions along Technology Acceptance Model – TAM (5-point scale)
Note, we decided against comparative assessment; would be unfair due to “black-box” nature of Deep RL
E.g., “why was adaptation X
chosen instead of adaptation Y in
state S?”
E.g., “using XRL-DINE in my job
would increase my job
productivity.”

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
XRL-DINE
SE2023 16
User Study – Initial Results
RQ1: Task Performance
 Effectiveness:  0.875
 Efficiency:  0.67
o cf. inspections:  0.07 [Petersen et al., 2008]
o cf. debugging:  0.03 [Ceccato et al., 2015]
17%
31%
31%
21%
Bachelor Student
Master Student
Academia
Industry
0.0 0.5 1.0 1.5
0.0 0.2 0.4 0.6 0.8 1.0
48 Participants

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
XRL-DINE
SE2023 17
User Study – Initial Results
RQ2: Technology Acceptance
 Useful for 61% of participants
 Easy to Use for 55% of participants
4% 4% 6% 2% 6% 4%
17% 19% 15%
10%
10%
6%
21% 23% 23%
27% 13% 21%
46%
48%
44% 48%
48% 48%
13% 6% 13% 13%
23% 21%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Faster Better Increased
Productivity
Increased
Effectiveness
Easier to do
job
useful
Perceived Usefulness
Extremely Unlikely Quite Unlikely Neither Quite Likely Extremely Likely
4% 6% 2% 6% 4%
19% 15%
10%
10%
6%
23% 23%
27% 13% 21%
48%
44% 48%
48% 48%
6% 13% 13%
23% 21%
Better Increased
Productivity
Increased
Effectiveness
Easier to do
job
useful
Perceived Usefulness
ikely Quite Unlikely Neither Quite Likely Extremely Likely
% 6% 10% 6%
15%
6%
% 21% 17% 21%
13% 25%
% 19% 15%
27%
15%
17%
%
40%
33%
35%
48% 40%
% 15%
25%
10% 10% 13%
operate Easy to get the
dashboard to do
what I want
Interaction clear
and
understandable
Visualization
clear and
understandable
Easy to become
skillfil
Easy to Use
Perceived Ease of Use
tremely Unlikely Quite Unlikely Neither Quite Likely Extremely Likely
Useful Easy to Use

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Discussion
A. Metzger, ACSOS 2022 19
Explanation Requirements from Social Sciences [Miller @ Artif. Intell. 2019]
Contrastive: “why P happened instead of Q?”
 “Reward Channel Dominance” DINE
Selective: “no need for complete course of events”
 “Reward Channel Extrema” DINE &
“Uncertain Actions” DINE
Causal: “most likely explanation not
necessarily the best”
 Future work: e.g., check whether agent relies on
spurious correlations (and not causality)
[Gajcin et al. @ AAAMAS Wkshp 2022]
Social: “transfer of knowledge as part of a
conversation”
 Future work: e.g., Chatbots for XAI (à la ChatGPT ;-)


?
? Prediction
Explanation
Train will be
delayed
Train passed
last light 5
min later
than typical
This also
happened
yesterday, but
why today a
problem?
Because of
attribute
“number of
trains behind
current one” > 5
What would
have to change
such that not
delay
(counterfactual)
“number of
trains behind
current one” < 2
Human
Chat4XAI

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Thanks!
SE2023 20
Research leading to these results has received funding from the EU’s
Horizon 2020 research and innovation programme under grant
agreement no. 871493 – www.dataports-project.eu
Track on
Data and AI Driven Engineering
(DAIDE)
As part of SEAA 2023 (49th Euromicro Conference
on Software Engineering and Advanced
Applications)
https://dsd-seaa2023.com/daide/
Deadline: April, 3
Special Issue on
Software Engineering and AI for Data
Quality
ACM Journal of Data and
Information Quality (JDIQ)
https://bit.ly/JDIQ-SI
Deadline: April, 30
Blatant advertising ;-)
Call for Papers

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
References
[Arabnejad et al., 2017] H. Arabnejad, C. Pahl, P. Jamshidi, and G. Estrada, “A comparison of reinforcement learning techniques for
fuzzy cloud autoscaling,” in 17th Intl Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017
[De Lemos et al. 2010] R. de Lemos et al., “Software Engineering for Self-Adaptive Systems: A Second Research Roadmap,” in Softw.
Eng. for Self-Adaptive Systems II, ser. LNCS. Springer, 2013, vol. 7475, pp. 1–32
[Di Francescomarino et al. 2018] Chiara Di Francescomarino, Chiara Ghidini, Fabrizio Maria Maggi, Fredrik Milani: Predictive Process
Monitoring Methods: Which One Suits Me Best? BPM 2018: 462-479
[Dulac-Arnold et al. 2015] Gabriel Dulac-Arnold, Richard Evans, Peter Sunehag, Ben Coppin: Reinforcement Learning in Large Discrete
Action Spaces. CoRR abs/1512.07679 (2015)
[Evermann et al. 2017] Evermann, J., Rehse, J., Fettke, P.: Predicting process behaviour using deep learning. Decision Support Systems
100, 2017
[Filho & Porter, 2017] Filho, R.V.R., Porter, B.: Defining emergent software using continuous self-assembly, perception, and learning.
TAAS 12(3), 16:1–16:25 (2017)
[Jamshidi et al., 2015] P. Jamshidi, A. Molzam Sharifloo, C. Pahl, A. Metzger, and G. Estrada, “Self-learning cloud controllers: Fuzzy Q-
learning for knowledge evolution (short paper),” in Int’l Conference on Cloud and Autonomic Computing (IC- CAC 2015) Cambridge,
USA, September 21-24, 2015,
[Kephart & Chess, 2003] J. O. Kephart and D. M. Chess, “The vision of autonomic computing,” IEEE Computer, vol. 36, no. 1, pp. 41–50,
2003.
[Klein et al. 2014] C. Klein, M. Maggio, K. Arzen, F. Hernandez-Rodriguez, “Brownout: building more robust cloud applications”. In:
36th Intl Conf. on Software Engineering (ICSE 2014), pp. 700–711. ACM, 2014
[Mann, 2016] Z. Mann, “Interplay of virtual machine selection and virtual machine placement”, in: 5th European Conf. on Service-
Oriented and Cloud Computing, ESOCC’16, LNCS vol. 9846, pp. 137–151 (2016)
[Metzger & Pohl, 2014] A. Metzger, K. Pohl, “Software product line engineering and variability management: Achievements and
challenges,” in ICSE Future of Software Engineering Track (FOSE 2014), ACM, 2014, pp. 70–84.
21
SE2023

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
References
[Metzger et al. 2019] A. Metzger, A. Neubauer, P. Bohn, and K. Pohl, “Proactive process adaptation using deep learning ensembles,” in
31st Int’l Conf. on Advanced Information Systems Engineering (CAiSE 2019), LNCS, vol. 11483. Springer, 2019, pp. 547–562
[Metzger et al. 2020] A. Metzger, C. Quinton, Z. Á. Mann, L. Baresi, K. Pohl, “Realizing Self-Adaptive Systems via Online Reinforcement
Learning and Feature-Model-guided Exploration”, Computing, Springer, March, 2022
[Metzger et al. 2020a] A. Metzger, C. Quinton, Z. Mann, L. Baresi, and K. Pohl, “Feature model-guided online reinforcement learning
for self-adaptive services,” in 18th Int’l Conf. on Service-Oriented Computing (ICSOC 2020), LNCS 12571, Springer, 2020
[Metzger et al. 2020b] A. Metzger, T. Kley, and A. Palm, “Triggering proactive business process adaptations via online reinforcement
learning,” in 18th Int’l Conf. on Business Process Management (BPM 2020), LNCS 12168. Springer, 2020e
[Palm et al. 2020] A. Palm, A. Metzger, and K. Pohl, “Online reinforcement learning for self-adaptive information systems,” in 32nd Int’l
Conf. on Advanced Information Systems Engineering (CAiSE 2020), LNCS 12127. Springer, 2020
[Porter et al., 2020] B. Porter, R. R. Filho, and P. Dean, “A survey of methodology in self-adaptive systems research,” in ACSOS. IEEE,
2020
[Salehie & Tahvildari, 2009] M. Salehie and L. Tahvildari, “Self-adaptive software: Landscape and research challenges,” TAAS, vol. 4, no.
2, 2009.
22
SE2023

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
References
[Siegmund et al. 2012] N. Siegmund, S. Kolesnikov, C. Kästner, S. Apel, D. Batory, M. Rosenmüller, G. Saake, G.: Predicting Performance
via Automated Feature-interaction Detection. In: 34th Intl Conf. on Software Engineering (ICSE 2012), pp. 167–177, ACM, 2012
[Sutton & Barto, 2018] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA, USA: MIT Press,
2018
[Taylor & Stone, 2009] M. Taylor, P. Stone: Transfer learning for reinforcement learning domains: A survey. J. Mach. Learn. Res. 10,
1633–1685 (2009)
[Wang et al., 2020] Hongbing Wang, Jiajie Li, Qi Yu, Tianjing Hong, Jia Yan, Wei Zhao: Integrating recurrent neural networks and
reinforcement learning for dynamic service composition. Future Gener. Comput. Syst. 107: 551-563 (2020)
[Weyns et al. 2013] Danny Weyns, Nelly Bencomo, Radu Calinescu, Javier Cámara, Carlo Ghezzi, Vincenzo Grassi, Lars Grunske, Paola
Inverardi, Jean-Marc Jézéquel, Sam Malek, Raffaela Mirandola, Marco Mori, Giordano Tamburrelli: Perpetual Assurances for Self-
Adaptive Systems. Software Engineering for Self-Adaptive Systems 2013: 31-63
[Weyns, 2021] Danny Weyns, Introduction to Self-Adaptive Systems: A Contemporary Software Engineering Perspective, Wiley, 2021.
[Weyns et al., 2022] D. Weyns, I. Gerostathopoulos, N. Abbas, J. Andersson, S. Biffl, P. Brada, T. Bures, A. D. Salle, P. Lago, A. Musil, J.
Musil, and P. Pelliccione, “Preliminary results of a survey on the use of self-adaptation in industry,” in 17th Intl Symp. on Software
Engineering for Adaptive and Self-Managing Systems, SEAMS@ICSE 2022, 2022.
[Xu et al., 2012] C. Xu, J. Rao, and X. Bu, “URL: A unified reinforcement learning approach for autonomic cloud management,” J.
Parallel Distrib. Comput., vol. 72, no. 2, pp. 95–105, 2012
23
SE2023

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Fundamentals
MAPE-K Reference Model [Kephart & Chess, 2003; Salehie & Tahvildari, 2009]
• Architecture as defining characteristic of adaptive system [Weyns, 2020]
SE2023 24
Example: Adaptive Web Shop
• Monitor: Sudden increase in number of concurrent users (workload peak)
• Analyze: Slow response of web shop if no adaptation
• Plan: Deactivating optional recommendation feature
• Execute: Replace dynamic recommendations with static banner
Analyze
Monitor Execute
Plan
Knowledge
Identify concrete
adaptations
Enact concrete
adaptation
Determine need for
adaptation
Collect and aggregate
observations
System Logic
Sensors Effectors
0
e
+
0
0
1
e
+
0
5
2
e
+
0
5
3
e
+
0
5
4
e
+
0
5
5
e
+
0
5
6
e
+
0
5
1
0
0
1
5
0
2
0
0
2
5
0
d
$
w
o
r
k
l
o
a
d
Workload
t

©
SSE,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Online Reinforcement Learning
25
MAPE-K Model Integrated Model
RL Model
Analyze
Monitor Execute
Plan
Knowledge
Online RL for SAS
Execute
Policy
(K)
Monitor
Action
Selection
(A + P)
Policy
Update
Action
A
State S
Reward R
Next state S’
Action A
State S
Reward R
Action
Selection
Next state S’
RL Agent
Policy
Policy
Update
Environ-
ment
Integrating MAPE-K and Reinforcement Learning [Metzger et al., 2022]
SE2023

Explainable Online Reinforcement Learning for Adaptive Systems

Recommended

Recommended

More Related Content

Similar to Explainable Online Reinforcement Learning for Adaptive Systems

Similar to Explainable Online Reinforcement Learning for Adaptive Systems (20)

More from Andreas Metzger

More from Andreas Metzger (14)

Recently uploaded

Recently uploaded (20)

Explainable Online Reinforcement Learning for Adaptive Systems

Editor's Notes