Presented by Pengyuan (Eric) Lu at the International Conference on Assured Autonomy 2023.
Abstract: Models of actual causality leverage domain knowledge to generate convincing diagnoses of events that caused an outcome. It is promising to apply these models to diagnose and repair run-time property violations in cyber-physical systems (CPS) with learning-enabled components (LEC). However, given the high diversity and complexity of LECs, it is challenging to encode domain knowledge (e.g., the CPS dynamics) in a scalable actual causality model that could generate useful repair suggestions. In this paper, we focus causal diagnosis on the input/output behaviors of LECs. Specifically, we aim to identify which subset of I/O behaviors of the LEC is an actual cause for a property violation. An important by-product is a counterfactual version of the LEC that repairs the run-time property by fixing the identified problematic behaviors. Based on this insights, we design a two-step diagnostic pipeline: (1) construct and Halpern-Pearl causality model that reflects the dependency of property outcome on the component's I/O behaviors, and (2) perform a search for an actual cause and corresponding repair on the model. We prove that our pipeline has the following guarantee: if an actual cause is found, the system is guaranteed to be repaired; otherwise, we have high probabilistic confidence that the LEC under analysis did not cause the property violation. We demonstrate that our approach successfully repairs learned controllers on a standard OpenAI Gym benchmark.
Causal Repair of Learning-Enabled Cyber-physical Systems
1. Causal Repair of Learning-enabled
Cyber-physical Systems
Pengyuan (Eric) Lu*, Ivan Ruchkin+
, Matthew Cleaveland*,
Oleg Sokolsky* and Insup Lee*
*PRECISE Center, University of Pennsylvania
+
Trustworthy Engineered Autonomy Lab, University of Florida
The 2nd
International Conference on Assured Autonomy
June 6th
, 2023
2. Outline
1. Motivation
2. Background
3. Problem statement
4. Solution Part I: Constructing Halpern-Pearl Model
5. Solution Part II: Searching for Repair
6. Experiment results
2
4. Motivation: What caused the failure?
Source: Nando’s Giphy page [link]
Source: Wall Street Journal [link]
4
5. ● Failures can be formalized as violations
of specifications at runtime
● E.g. signal temporal logics (STL)
● Repair: failure → success
5
Motivation: Repair the Controller
6. Statistical/non-causal Repair
Moosbrugger et al., 2017:
● Runtime observation ⇒ statistical analysis ⇒ diagnosis and repair suggestion
But correlation ≠ causation!
I got perfect scores because I
wear this coat!
6
Observation:
● No coat, 90% exam score
● Coat, 100% exam score
7. Causal Diagnosis and Repair
Studying
hard
Going to
office hours
Perfect
scores
Ibrahim et al., 2019:
7
8. The Problem of Learning-enabled CPS
● They do not follow a “standard” internal structure!
Deep Q Network
Source: Renu Khandelwal [link]
8
9. Motivation: Diagnosis and Repair on I/Os
● We suspect the I/Os of a component, e.g.
controller of this mountain car
● A cause is a factual assignment of output values
to a subset of input values
○ “It is because x1
is mapped to y1
, x2
is
mapped to y2
, …, the CPS failed.”
● A repair is a reassignment of output values to
these input values
○ “Had we mapped x1
to y1
’, x2
to y2
’, …,
the CPS would have succeeded.”
Denote the factual behavior as a mapping
ctrl in ctrl out
(0, 0) 0.2
(0, 0.1) 0.21
… …
9
10. Challenges / Contributions
● Constructing a causal model to encode the dependency of a suspected
controller’s I/O behaviors to an STL outcome
○ Feasible for efficient search on the cause and repair
○ Repair on a cause must be a minimal repair
● Searching algorithm for a cause and corresponding repair on the model
○ What to do if our suspicion is wrong?
10
12. Halpern-Pearl Causality Model (HP Model)
Studying hard
1
Going to
Office Hours
0
Perfect Scores
0
Alice is a
student
1
Avg sleeping
hours
7
Credits taken
this semester
10
PS = SH ∧ GtOH
SH = …
GtOH = …
● Endogenous nodes :
candidates to be blamed
● Exogenous nodes : a fixed
context, not to be blamed
● Value space of all nodes
● Dependency equations
12
13. Halpern-Pearl Causality Model (HP Model)
● An outcome is a proposition on endogenous nodes’ value assignments
○ E.g. perfect score := 1
● A candidate cause (of an outcome) is a conjunction on endogenous node’s
value assignments
○ E.g. (studying hard := 1) Λ (going to office hours := 0)
○ E.g. going to office hours := 0
○ Can be written in vector form
13
14. Halpern-Pearl Causality Model (HP Model)
Studying hard
1
Going to
Office Hours
1
Perfect Scores
1
Alice is a
student
1
Avg sleeping
hours
7
Credits taken
this semester
10
PS = SH ∧ GtOH
SH = …
GtOH = …
● Repair: Reassignment of a
subset of endogenous nodes
to change the outcome
● Here, cause = (GtOH := 0),
repair = (GtOH := 1)
14
15. What makes (going to office hours := 0) a cause of (perfect scores := 0), but not
(studying hard := 1) Λ (going to office hours := 0)?
● AC1: “Bad” outcome must be factual ✅
● AC2: Exists “good” reassignment of the candidate causes to make the
outcome “good” ✅
● AC3: Candidate cause must be minimal
○ No proper subset of the candidate cause satisfies AC2!
Three Criteria for a Cause (Informal)
15
16. Three Criteria for a Cause (Formal)
Based on a factual value assignment, what makes a cause of ?
● AC1: outcome must be true
● AC2: exists partition (candidate, circumstance, others)
○ AC2(a): exists counterfactual , , that flips outcome to
○ AC2(b): if we fix factual and only changes , no matter
how we change the values of any subset of , outcome maintains at
● AC3: no subset of satisfies AC1 and AC2
16
18. Problem Statement
On an observed trace with violation of property , how can we use HP causality
model to identify a subset of a suspected component’s I/O behaviors that
caused the violation, and provide an alternative behavior as a minimal repair?
● Sub-problem 1: Encode the dependency structure of the behaviors to
outcome of as an HP model
● Sub-problem 2: On the constructed HP model, design a search algorithm for a
cause, that
○ Upon success, return a corresponding repair
○ Upon failure, quantify the confidence of the violation is not caused by
Assume: (1) robust outcome and a simulator
(2) Lipschitz-continuous
18
19. Proposed Solution
19
1
u111
0
Outcome of φ
Component C
f
SIMULATE o
DECODE
ENCODE
0
u112
1
u121
0
u122
1
u211
0
u212
0
u221
0
u222
Step 1: HP model construction
1
u111
1
Outcome of φ
Component C
f
SIMULATE o
DECODE
ENCODE
1
u112
1
u121
0
u122
1
u211
0
u212
1
u221
0
u222
Step 2: Search for cause and
repair on the HP model
22. Issues of Infinite HP Model
● Infinitely many endogenous nodes!
○ Infinitely large search space for counterfactual value assignments
● Can we shrink down the search space?
22
26. Issues of Discretized HP Model
● Recall AC3 of HP causality
○ Cause is found by the counterfactual node value assignment that flips
the outcome and minimally disagrees with the factual one
○ Minimality: no proper subset of the disagreeing nodes can repair the
outcome
● This needs to reflect a minimal repair
26
27. Goal: Partial Order Preservation
27
Only value changes at one node, but we need set containment
28. Goal: Partial Order Preservation
● Partial order on behaviors
○ For two , is closer to the factual than ?
● Partial order on endogenous node value assignments
○ For two , are the disagreeing nodes by and the factual
a subset of the disagreeing nodes by and ?
28
29. Encoding of Behaviors
1
0 2
10
11
12
5 6 7 dim 1
dim 2
In Out
[0, 1] [6, 7] × [10, 11]
[1, 2] [5, 6] × [10, 11]
29
30. Encoding of Behaviors
1
0 2
10
11
12
5 6 7 dim 1
dim 2
In Out
[0, 1] [6, 7] × [10, 11]
[1, 2] [5, 6] × [10, 11]
30
In Out dim1 Out dim2 Proposition
[0, 1] ≥ 5 ≥ 10 1
[0, 1] ≥ 5 ≥ 11 0
[0, 1] ≥ 6 ≥ 10 1
[0, 1] ≥ 6 ≥ 11 0
… … … …
Encoded
Proposition
34. Random Sampling for a “Good” Counterfactual
● On the propositional HP model, we uniformly random sample (allowed) node
value assignments
● There is a chance that we are suspecting the wrong component
○ The component’s behaviors is not, or is not the only cause of the failure
● The more assignments we sampled without repairing the outcome, the more
confident we are that our suspicion is wrong
34
35. How Many Samples?
● is a portion threshold
● is a significance level
● denotes quantile of standard Normal distribution
Wilson 1927:
● If we uniformly sample assignments in a row without flipping the outcome,
then we are confident that the portion of “good” counterfactuals is
● E.g.
35
36. Interpolation for a Cause
● After sampling, we only found a “good” counterfactual behavior
● To find the cause, we need a counterfactual that is minimally different from the
factual and can repair the outcome
● There must exist such an assignment between and
⇔
36
38. Interpolation for a Cause
factual “bad”
sampled “good”
38
✅
behavior in between
39. Interpolation for a Cause
factual “bad”
sampled “good”
39
✅
behavior in between
40. Interpolation for a Cause
factual “bad”
sampled “good”
40
❌
behavior in between
41. Interpolation for a Cause
factual “bad”
sampled “good”
41
✅
behavior in between
42. Interpolation for a Cause
factual “bad”
sampled “good”
42
behavior in between
❌
43. Interpolation for a Cause
factual “bad”
sampled “good”
43
final interpolated
44. Interpolation for a Cause
● We step from towards , until it no longer can repair the outcome
● Theorem: the interpolated is a repair, and the difference in node
assignments and is a cause based on the HP model
44
46. OpenAI Gym Mountain Car + DNN Controllers
● Input space of controller:
● Output space of controller:
46
deep neural network
47. Results
● Input space is discretized into 18 x 14 = 252 cells
● 153 out of the 252 cells are mapped to a different value
● Repair = “had these 153 cells been mapped to the new values, the
mountain car would have succeeded from the initial state (-0.5, 0)”
47
Factual Controller Searched Controller
Interpolated/repaired
Controller
49. Conclusion
● We designed a causal diagnosis and repair algorithm for learning-enabled CPS
● The algorithm first constructs an Halpern-Pearl model, and then
○ Finds a cause and a corresponding repair on a component’s I/Os, or
○ Exit with a quantified confidence that the component’s I/Os are not to be
blamed
● We experimented with OpenAI Gym Mountain Car and successfully repaired a
DNN controller
49
50. Future Work
● From I/O to controller parameters
○ How does the repaired I/O help modify the neural network weights?
● Repairing from different initial states
○ How to repair one initial state without breaking others?
● Ongoing work: gradient-based repair by barrier methods
50
51. Acknowledgement
● We appreciate the support by ARO (W911NF-20-1-0080), AFRL and DARPA
(FA8750-18-C-0090).
● Any opinions expressed are those of the authors and do not necessarily
reflect the views of ARO, AFRL, DARPA, DoD or the United States
Government.
51