SlideShare a Scribd company logo
1 of 60
Download to read offline
1/42
Intrusion Tolerance for Networked Systems
Through Two-Level Feedback Control
NSE Seminar
Kim Hammar
kimham@kth.se
Division of Network and Systems Engineering
KTH Royal Institute of Technology
Nov 10, 2023
2/42
Use Case: Intrusion Tolerance
. . .
Clients
api gateways
Compute nodes
Storage nodes
Service
replica 1
Service
replica 2
Service
replica 3
Service
replica 4
Client interface & load balancer
▶ A replicated system offers services to a client population.
▶ The system must be highly available and provide correct
service without disruption.
2/42
Use Case: Intrusion Tolerance
. . .
Attacker Clients
api gateways
Compute nodes
Storage nodes
Service
replica 1
Service
replica 2
Service
replica 3
Service
replica 4
Client interface & load balancer
▶ An attacker seeks to intrude on the system and disrupt service
▶ The system should tolerate intrusions
▶ it should provide correct service
even if a fraction of replicas are compromised
2/42
Examples of Systems that Need to Tolerate Intrusions
. . .
Control planes Embedded systems
SCADA systems Payment systems
3/42
Intrusion Tolerance (Simplified)
Intrusion event Time of full recovery
Time
Recovery time
Survivability
Loss
Normal
performance
System
performance
Tolerance
Cumulative
performance loss
(want to minimize)
3/42
Intrusion-Tolerant Systems - State of The Art
▶ State-of-the-art intrusion-tolerant
systems involve 3 building blocks:
1. a protocol for service replication
2. a scaling strategy
3. a recovery strategy
▶ Given N replicas, the system provides
correct service with up to f = N−1
3
compromised replicas.
▶ Theoretical upper bound
▶ f is the tolerance threshold.
▶ Simple control strategies:
▶ Fixed number of replicas (no scaling)
▶ No recovery or periodic recovery
. . .
Clients
Service
replica 1
Service
replica 2
Service
replica 3
Service
replica 4
Client interface & load balancer
3/42
Intrusion-Tolerant Systems - State of The Art
▶ State-of-the-art intrusion-tolerant
systems involve 3 building blocks:
1. a protocol for service replication
2. a scaling strategy
3. a recovery strategy
▶ Given N replicas, the system provides
correct service with up to f = N−1
3
compromised replicas.
▶ Theoretical upper bound
▶ f is the tolerance threshold.
▶ Simple control strategies:
▶ Fixed number of replicas (no scaling)
▶ No recovery or periodic recovery
. . .
Clients
Service
replica 1
Service
replica 2
Service
replica 3
Service
replica 4
Client interface & load balancer
This work: Optimal intrusion recovery and scaling strategies
4/42
Can we use decision theory and learning-based methods to
automatically find effective security strategies?
Intrusion prevention
Simulation.
Small-scale. (2020)1.
Intrusion prevention
Optimal stopping.
Emulation, small-scale
Static attacker. (2021)2.
Intrusion response
Optimal multiple stopping.
Emulation, small-scale.
Static attacker. (2022)3
Intrusion response
Dynkin game.
Emulation, small-scale.
Dynamic attacker. (2022)4
Intrusion response
Decomposition.
Emulation, large-scale
Dynamic attacker. (2023)5
Intrusion tolerance
Two-level control.
Integration with bft.
Static attacker.
(This work)
1
Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and
Self-Play”. In: International Conference on Network and Service Management (CNSM 2020). Izmir, Turkey, 2020.
2
Kim Hammar and Rolf Stadler. “Learning Intrusion Prevention Policies through Optimal Stopping”. In:
International Conference on Network and Service Management (CNSM 2021). Izmir, Turkey, 2021.
3
Kim Hammar and Rolf Stadler. “Intrusion Prevention Through Optimal Stopping”. In: IEEE Transactions on
Network and Service Management 19.3 (2022), pp. 2333–2348. doi: 10.1109/TNSM.2022.3176781.
4
Kim Hammar and Rolf Stadler. “Learning Near-Optimal Intrusion Responses Against Dynamic Attackers”. In:
IEEE Transactions on Network and Service Management (2023), pp. 1–1. doi: 10.1109/TNSM.2023.3293413.
5
Kim Hammar and Rolf Stadler. “Scalable Learning of Intrusion Response through Recursive Decomposition”.
In: 14th International Conference on Decision and Game Theory for Security. Avignon, France, 2023.
5/42
Our Framework for Automated Security
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Learning &
Optimization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
▶ Source code: https://github.com/Limmen/csle
▶ Documentation: http://limmen.dev/csle/
▶ Demo:
https://www.youtube.com/watch?v=iE2KPmtIs2A&
6/42
Outline
▶ Use Case & Research Problem
▶ Use case: intrusion tolerance
▶ Goal: optimal control strategies for intrusion-tolerant systems
▶ Background
▶ Fault tolerance and intrusion tolerance
▶ State machine replication
▶ Our Contributions
▶ The tolerance control architecture
▶ Constrained two-level control problem
▶ Theoretical results
▶ Computational algorithms
▶ Comparison with State-of-the-art
▶ Implementation and evaluation
▶ Conclusions
6/42
Outline
▶ Use Case & Research Problem
▶ Use case: intrusion tolerance
▶ Goal: optimal control strategies for intrusion-tolerant systems
▶ Background
▶ Fault tolerance and intrusion tolerance
▶ State machine replication
▶ Our Contributions
▶ The tolerance control architecture
▶ Constrained two-level control problem
▶ Theoretical results
▶ Computational algorithms
▶ Comparison with State-of-the-art
▶ Implementation and evaluation
▶ Conclusions
6/42
Outline
▶ Use Case & Research Problem
▶ Use case: intrusion tolerance
▶ Goal: optimal control strategies for intrusion-tolerant systems
▶ Background
▶ Fault tolerance and intrusion tolerance
▶ State machine replication
▶ Our Contributions
▶ The tolerance control architecture
▶ Constrained two-level control problem
▶ Theoretical results
▶ Computational algorithms
▶ Comparison with State-of-the-art
▶ Implementation and evaluation
▶ Conclusions
6/42
Outline
▶ Use Case & Research Problem
▶ Use case: intrusion tolerance
▶ Goal: optimal control strategies for intrusion-tolerant systems
▶ Background
▶ Fault tolerance and intrusion tolerance
▶ State machine replication
▶ Our Contributions
▶ The tolerance control architecture
▶ Constrained two-level control problem
▶ Theoretical results
▶ Computational algorithms
▶ Comparison with State-of-the-art
▶ Implementation and evaluation
▶ Conclusions
6/42
Outline
▶ Use Case & Research Problem
▶ Use case: intrusion tolerance
▶ Goal: optimal control strategies for intrusion-tolerant systems
▶ Background
▶ Fault tolerance and intrusion tolerance
▶ State machine replication
▶ Our Contributions
▶ The tolerance control architecture
▶ Constrained two-level control problem
▶ Theoretical results
▶ Computational algorithms
▶ Comparison with State-of-the-art
▶ Implementation and evaluation
▶ Conclusions
7/42
Background: Fault Tolerance and Intrusion Tolerance
▶ Seminal work made by von Neumann and
Shannon in 1956.
▶ Initially focused on building
fault-tolerant circuits.
▶ Fault tolerance includes tolerance
against: software bugs, malicious
attacks, operator mistakes, etc.
▶ Key strategy for fault tolerance:
redundancy
▶ Redundancy is achieved through service
replication. Replicas are coordinated
through a consensus protocol.
. . .
Replicated system
Client interface
Request
Response
Consensus protocol
1 2 3 4 5
8/42
Background: Consensus
▶ Consensus is the problem of reaching agreement on a single
value among a set of distributed nodes subject to failures.
▶ Fascinating problem for many reasons:
▶ Key problem to build practical systems
▶ The problem comes in many flavors
▶ Paradoxical cases
▶ Rich theory
▶ Turing awardees working on consensus:
Dijkstra, 1972 Gray, 1998 Liskov, 2008 Lamport, 2013
9/42
Background: Consensus
Definition (Consensus)
We have N nodes indexed by 1, . . . , N. The nodes are connected
by a complete graph and communicate via message passing. Each
node starts with an input value v ∈ V. The goal is to agree on a
single value in V.
An algorithm A solves consensus if the following hold:
1. Agreement: No two correct nodes decide on different values
2. Termination: All nodes eventually decide.
3. Validity: If all nodes start with input v then they decide on v
10/42
Consensus Example: The Two Generals Problem (Gray ’78)
▶ Two generals are planning a coordinated attack from different
directions.
▶ One of the simplest consensus problems:
▶ Only two nodes: 1 and 2
▶ No process failures but link failures may occur
▶ Is the problem solvable?
10/42
Consensus Example: The Two Generals Problem (Gray ’78)
▶ Two generals are planning a coordinated attack from different
directions.
▶ One of the simplest consensus problems:
▶ Only two nodes: 1 and 2
▶ No process failures but link failures may occur
▶ Is the problem solvable? No! Can be proven by
contradiction.
11/42
When Is Consensus Solvable?
▶ Solvability depends on synchrony and failure assumptions.
▶ Three synchronicity models:
▶ The asynchronous model: no bounds on either delays or
clock drifts.
▶ The partially synchronous model: an upper bound exists
but the system may have periods of instability where the upper
bound does not hold.
▶ The synchronous model: there is an upper bound on the
communication delay and clock drift between any two nodes.
. . .
Synchronous system Asynchronous system Partially synchronous system
Internet
Organization WAN
12/42
When Is Consensus Solvable?
▶ Three main failure models:
▶ Crash-stop: nodes fail by crashing.
▶ Byzantine: failed nodes may behave arbitrarily (e.g., be
controlled by an attacker)
▶ Hybrid: Byzantine failures but each node is equipped with a
trusted component that only fails by crashing.
Byzantine failure
Crash-stop failure
Virtual machine
Hybrid failure
13/42
When Is Consensus Solvable?
Theorem (Summary of 40 years of research)
▶ Consensus is not solvable in the asynchronous model
▶ Consensus is solvable with a reliable network in the partially
synchronous model with N nodes and up to
▶ f = N−1
2 Crash-stop failures
▶ f = N−1
3 Byzantine failures
▶ f = N−1
2 Hybrid failures (assuming authenticated
channels)
▶ Consensus is solvable with a reliable network in the
synchronous model with N nodes and up to
▶ f = N − 1 Crash-stop failures
▶ f = N−1
2 Byzantine failures (assuming authenticated channels)
▶ f = N−1
2 Hybrid failures (assuming authenticated channels)
14/42
Outline
▶ Use Case & Research Problem
▶ Use case: intrusion tolerance
▶ Goal: optimal control strategies for intrusion-tolerant systems
▶ Background
▶ Fault tolerance and intrusion tolerance
▶ State machine replication
▶ Our Contributions
▶ The tolerance control architecture
▶ Constrained two-level control problem
▶ Theoretical results
▶ Computational algorithms
▶ Comparison with State-of-the-art
▶ Implementation and evaluation
▶ Conclusions
15/42
Two-Level Feedback Control for Intrusion Tolerance
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πN(bN)
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bN)
b1 b2 b3 b4 bN
▶ Node controllers with strategies π1, . . . , πN compute belief
states b1, . . . , bN and make local recovery decisions; the belief
states are transmitted to a global system controller with
strategy π, which controls the replication factor
▶ Key insight: the control problems correspond to classical
problems studied in operations research, namely the machine
replacement problem and the inventory replenishment
problem, both of which have been studied for nearly a century.
16/42
The tolerance Control Architecture
tolerance: Two-level recovery and scaling control with
feedback.
tolerance
Node 1
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Virtualization layer
Hardware
Node 2
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Virtualization layer
Hardware
Node N
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Virtualization layer
Hardware
. . .
Intrusion-tolerant consensus protocol
Client interface
System controller
State estimate Scaling action State estimate Scaling action State estimate Scaling action
. . .
Service requests Responses
Clients Attacker
Intrusion attempts
17/42
Correctness of tolerance (1/2)
Definition (Correct service)
We say that a system provides correct service if the healthy
replicas satisfy the following properties:
Each replica executes the same request sequence. (Safety)
Each request is eventually executed. (Liveness)
Each executed request was sent by a client. (Validity)
18/42
Correctness of tolerance (2/2)
Proposition
A system that implements the tolerance architecture provides
correct service provided that:
1. The controllers can only fail by crashing.
2. Network links are authenticated and reliable.
3. An attacker can not break cryptographic codes.
4. The system is partially synchronous.
5. At most k nodes recover simultaneously and at most f nodes
are compromised or crashed simultaneously.
6. Nt ≥ 2f + 1 + k at all times t.
Remark: tolerance does not ensure confidentiality as a
compromised node may leak information to the attacker. By
appropriate use of cryptography and firewalls, it is possible to
extend tolerance to provide confidentiality. Details omitted.
18/42
Correctness of tolerance (2/2)
Proposition
A system that implements the tolerance architecture provides
correct service provided that:
1. The controllers can only fail by crashing.
2. Network links are authenticated and reliable.
3. An attacker can not break cryptographic codes.
4. The system is partially synchronous.
5. At most k nodes recover simultaneously and at most f
nodes are compromised or crashed simultaneously.
6. Nt ≥ 2f + 1 + k at all times t.
Remark: tolerance does not ensure confidentiality as a
compromised node may leak information to the attacker. By
appropriate use of cryptography and firewalls, it is possible to
extend tolerance to provide confidentiality. Details omitted.
19/42
The Local Level: Intrusion Recovery Control (1/4)
▶ Nodes Nt ≜ {1, 2, . . . , Nt} and controllers
π1,t, π2,t, . . . , πN,t.
▶ Hidden states SN = {H, C, ∅} (see
figure).
▶ Actions: (W)ait and (R)ecover
▶ Observation oi,t ∼ Z represents the
number of ids alerts related to node i at
time t.
▶ A node controller computes
bi,t ≜ P[Si,t = C | oi,1, . . .] and makes
decisions ai,t = πi,t(bi,t) ∈ {W, R}.
H C
∅
Crashed
Healthy Compromised
pC1 pC2
pR or a = R
pA
20/42
The Local Level: Probability of Failure (2/4)
10 20 30 40 50 60 70 80 90 100
0.5
1
pA = 0.1 pA = 0.05 pA = 0.025
pA = 0.01 pA = 0.005
t
P[St = C ∪ St = ∅]
Probability of node compromise (St = C) or crash (St = ∅) in function of
time t, assuming no recoveries.
21/42
The Local Level: Intrusion Recovery Control (3/4)
Goals: minimize the average time-to-recovery T
(R)
i and minimize
the frequency of recoveries F
(R)
i :
minimize Ji ≜ lim
T→∞
h
ηT
(R)
i,T + F
(R)
i,T
i
(1)
= lim
T→∞
"
1
T
T
X
t=1
ηsi,t − ai,tηsi,t + ai,t
| {z }
≜cN(si,t ,ai,t )
#
We define intrusion recovery to be the problem of minimizing the
above objective subject to a bounded-time-to-recovery (btr)
safety constraint which ensures that the time between two
recoveries of a node is bounded to by ∆R, which can be configured
by the system administrator.
22/42
The Local Level: Intrusion Recovery Control (4/4)
Problem (Optimal Intrusion Recovery Control)
minimize
πi,t ∈ΠN
Eπi,t [Ji | Bi,1 = pA] ∀i ∈ N (2a)
subject to ai,k∆R
= R ∀i, k (2b)
si,t+1 ∼ fN(· | si,t, ai,t) ∀i, t (2c)
oi,t+1 ∼ Z(· | si,t) ∀i, t (2d)
ai,t+1 ∼ πi,t(bi,t) ∀i, t (2e)
ai,t ∈ AN, si,t ∈ SN, oit, ∈ O ∀i, t (2f)
23/42
Numerical Results for the Intrusion Recovery Problem
0.0 0.5 1.0
bi,1
2
3
4
V
?
i,1
(b
i,1
)
pA = 0.1
0.0 0.5 1.0
bi,1
2
4
V
?
i,1
(b
i,1
)
pA = 0.05
0.0 0.5 1.0
bi,1
1
2
3
V
?
i,1
(b
i,1
)
pA = 0.025
0.0 0.5 1.0
bi,1
2
4
V
?
i,1
(b
i,1
)
pA = 0.01
0.0 0.5 1.0
bi,1
0
2
4
V
?
i,1
(b
i,1
)
pA = 0.005
Illustration of the optimal value function V ⋆
i,t(bi,t) for the local control
problem. V ⋆
i,t was computed using the incremental pruning algorithm;
black lines indicate the value function; red lines indicate the
alpha-vectors.
23/42
Numerical Results for the Intrusion Recovery Problem
0.0 0.5 1.0
bi,1
2
3
4
V
?
i,1
(b
i,1
)
pA = 0.1
0.0 0.5 1.0
bi,1
2
4
V
?
i,1
(b
i,1
)
pA = 0.05
0.0 0.5 1.0
bi,1
1
2
3
V
?
i,1
(b
i,1
)
pA = 0.025
0.0 0.5 1.0
bi,1
2
4
V
?
i,1
(b
i,1
)
pA = 0.01
0.0 0.5 1.0
bi,1
0
2
4
V
?
i,1
(b
i,1
)
pA = 0.005
Illustration of the optimal value function V ⋆
i,t(bi,t) for the local control
problem. V ⋆
i,t was computed using the incremental pruning algorithm;
black lines indicate the value function; red lines indicate the
alpha-vectors.
Based on the above results we hypothesize that there exists an
optimal recovery strategy of the form:
0
1
1
0
ai,t = π⋆
i,t(b)
α⋆
t
b
(W)ait
(R)ecover
W R
24/42
Structure of an Optimal Intrusion Recovery Strategy (1/2)
Theorem (Optimal Threshold Recovery Strategies)
If the following holds
pA, pU, pC1 , pC2 ∈ (0, 1) (A)
pA + pU ≤ 1 (B)
pC1 (pU − 1)
pA(pC1 − 1) + pC1 (pU − 1)
≤ pC2 (C)
Z(oi,t | si,t) > 0 ∀oi,t, si,t (D)
Z is tp-2 (E)
then there exists an optimal recovery strategy π⋆
i,t for each node
i ∈ N that satisfies
π⋆
i,t(bi,t) = R ⇐⇒ bi,t ≥ α⋆
t ∀t, α⋆
t ∈ [0, 1] (3)
25/42
Structure of an Optimal Intrusion Recovery Strategy (2/2)
Corollary (Stationary Optimal Strategy as ∆R → ∞)
The recovery thresholds satisfy α⋆
t+1 ≥ α⋆
t for all
t ∈ [k∆R, (k + 1)∆R] and as ∆R → ∞, the thresholds converge to
a time-independent threshold α⋆.
20 40 60 80 100
0.85
0.9
0.95
α⋆
t
t
The thresholds were computed using the Incremental Pruning algorithm.
26/42
The Global Level: Controlling the Replication Factor (1/6)
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πN(bN)
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bN)
b1 b2 b3 b4 bN
27/42
The Global Level: Controlling the Replication Factor (2/6)
0 1
smax
−1
smax
. . .
at = 1 at = 1
at = 0 at = 0 at = 0 at = 0
▶ At each time t, the system controller receives the belief
states b1,t, . . . , bN,t from the node controllers and decides if
the replication factor N should be increased.
▶ State st: estimated number of healthy nodes based on
b1,t, . . . , bN,t
▶ Actions: at ∈ {0, 1} ≜ AS, where at = 1 means that a new
node is added to the system and at = 0 is a passive action.
▶ smax is the maximum number of nodes (needed to define the
theoretical model). In practice smax may be very large.
28/42
The Global Level (3/6): Mean Time to Failure
10 20 30 40 50 60 70 80 90 100
200
400
600
pA = 0.1 pA = 0.05 pA = 0.025
pA = 0.01 pA = 0.005
N1
E[T(f )]
Mean time to failure (mttf) in function of the initial number of nodes
N1; T(f )
is a random variable representing the time when Nt < f + 1 with
f = 3 and k = 1; the curves relate to different intrusion probabilities pA.
29/42
The Global Level (4/6): Reliability Curve
10 20 30 40 50 60 70 80 90 100
0.2
0.4
0.6
0.8
1
N1 = 25 N1 = 50 N1 = 100
N1 = 200 N1 = 400
t
R(t)
Reliability curves for varying number of nodes N; The reliability function
is defined as R(t) ≜ P[T(f )
> t] where T(f )
is a random variable
representing the time when Nt < f + 1 with f = 3.
30/42
The Global Level (5/6): Controlling the Replication Factor
Goal: maximize the average service availability T(A) and
minimize the number of nodes. We model these two goals with the
following constrained objective
minimize J ≜ lim
T→∞
"
1
T
T
X
t=1
st
#
(4)
subject to T(A)
≥ ϵA
=⇒ lim
T→∞
"
1
T
T
X
t=1
JNt ≥ 2f + 1K
#
≥ ϵA
=⇒ lim
T→∞
"
1
T
T
X
t=1
Jst ≥ f + 1K
#
≥ ϵA
where ϵA is the minimum allowed average service availability with
respect to the tolerance threshold f .
31/42
The Global Level (6/6): Controlling the Replication Factor
Problem (Optimal Control of the Replication Factor)
minimize
π∈ΠS
Eπ [J | S1 = N] (5a)
subject to Eπ
h
T(A)
i
≥ ϵA ∀t (5b)
st+1 = fS(st, at, δt) ∈ SS ∀t (5c)
δt ∼ p∆(st) ∀t (5d)
at+1 ∼ πt(st) ∈ AS ∀t (5e)
32/42
Structure of an Optimal Scaling Strategy
Theorem
If the following holds
∃π ∈ ΠS such that Eπ
h
T(A)
i
≥ ϵA (A)
fS(s′
| s, a) > 0 ∀s′
, s, a (B)
smax
X
s′=s
fS(s′
| ŝ + 1, a) ≥
smax
X
s′=s
fS(s′
| ŝ, a) ∀s, ŝ, a (C)
then there exists two strategies πλ1 and πλ1 that satisfy
πλ1 (st) = 1 ⇐⇒ st ≤ β1 πλ2 (st) = 1 ⇐⇒ st ≤ β2 ∀t (6)
and an optimal randomized threshold strategy π⋆ that satisfies
π⋆
(st) = κπλ1 (st) + (1 − κ)πλ2 (st) ∀t (7)
for some probability κ ∈ [0, 1].
33/42
Efficient Algorithms for Computing the Optimal Control
Strategies
▶ The problem of computing an optimal scaling strategy has
polynomial time-complexity. This follows because the problem
can be formulated as a linear program of polynomial size.
▶ The problem of computing the optimal intrusion recovery
strategies is in the complexity class pspace-hard, and thus
no efficient (polynomial-time) algorithm for solving this
problem is known. (Note that p ⊆ np ⊆ pspace.)
▶ To manage the high computational complexity of
computing the optimal recovery strategies we leverage
Theorem 1 and Corollary 1.
33/42
Algorithm for The Local Control Problem
Algorithm 1: Recovery Threshold Optimization (rto)
1 Input: η, pA, pC1 , pC2 , pU, Z, ∆R
2 Parametric optimization algorithm: PO
3 Output: A near-optimal local control strategy π̂θ,t
4 Algorithm
5 d ← 1 − ∆R if ∆R < ∞ else d ← 1
6 Θ ← [0, 1]d
7 For each θ ∈ Θ, define πi,θ(bt) as
πi,θ(bt) ≜
(
R if bt ≥ θi where i = max[t, d]
W otherwise
8 Jθ ← Eπi,θ
[Ji ]
9 π̂θ,t ← PO(Θ, Jθ)
10 return π̂θ,t
33/42
Algorithm for The Global Control Problem
Algorithm 2: Linear Program for Scaling (lp-r)
1 Input: smax, ϵA, N, f , p∆
2 Linear programming solver: LPSolver
3 Output: An optimal global control strategy π⋆
4 Algorithm
5 Solve the following linear program with LPSolver
minimize
ρ
X
s∈SS
X
a∈AS
sρ(s, a) (8a)
subject to (8b)
ρ(s, a) ≥ 0 ∀s ∈ SS, a ∈ AS (8c)
X
s∈SS
X
a∈AS
ρ(s, a) = 1 (8d)
X
a∈AS
ρ(s, a) =
X
s′∈SS
X
a∈AS
ρ(s′
, a)fS(s′
|s, a) ∀s ∈ SS (8e)
X
s∈SS
X
a∈AS
ρ(s, a)Jst ≥ f + 1K ≥ ϵA (8f)
6 Let ρ⋆ denote the solution to the above program and define
π⋆ as
π⋆
(a|s) ≜
ρ⋆(s, a)
P
s∈SS
ρ⋆(s, a)
∀s ∈ SS, a ∈ AS
return π⋆
34/42
Evaluation of the rto Algorithm (1/2)
0.0 0.2
0.09
0.10
0.11
Avg
cost
J
i
∆R = 3
0.0 2.5
0.12
0.14
∆R = 4
0 10
0.12
0.15
0.18
∆R = 5
0 5
0.15
0.20
∆R = 6
0 10
0.15
0.20
0.25
∆R = 7
0 10
0.20
0.25
∆R = 8
0 10
0.20
0.25
∆R = 9
0 10
0.20
0.25
∆R = 10
0 10
0.20
0.25
Avg
cost
J
i
∆R = 11
0 10
0.20
0.25
∆R = 12
0 10
0.20
0.25
∆R = 13
0 20
0.20
0.25
∆R = 14
0 20
0.20
0.25
∆R = 15
0 20
0.20
0.25
∆R = 16
0 10
0.20
0.25
∆R = 17
0 10
0.20
0.25
∆R = 18
0 10
Time (min)
0.20
0.25
Avg
cost
J
i
∆R = 19
0 20
Time (min)
0.20
0.25
∆R = 20
0 20
Time (min)
0.20
0.25
0.30
∆R = 21
0 20
Time (min)
0.20
0.25
0.30
∆R = 22
0 20
Time (min)
0.20
0.25
0.30
∆R = 23
0 20
Time (min)
0.20
0.25
0.30
∆R = 24
0 20
Time (min)
0.20
0.25
0.30
∆R = 25
0 20
Time (min)
0.20
0.40
∆R = ∞
optimal cem de bo spsa ppo lower bound
mean values from evaluations with 20 different random seeds; ± indicate
the 95% confidence interval based on the Student’s t-distribution.
Method
∆R = 5 ∆R = 15 ∆R = 25 ∆R = ∞
Time (min) Ji Time (min) Ji Time (min) Ji Time (min) Ji
cem 1.04 0.12 ± 0.01 8.84 0.17 ± 0.06 14.48 0.19 ± 0.08 11.81 0.16 ± 0.01
de 2.35 0.12 ± 0.03 8.98 0.17 ± 0.01 15.45 0.18 ± 0.02 22.68 0.16 ± 0.01
spsa 10.78 0.18 ± 0.01 88.35 0.58 ± 0.40 123.85 0.77 ± 0.48 4.20 0.20 ± 0.02
bo 29.18 0.12 ± 0.02 62.57 0.17 ± 0.05 90.26 0.18 ± 0.12 9.07 0.15 ± 0.06
ppo 28.20 0.18 ± 0.01 30.01 0.19 ± 0.02 30.33 0.21 ± 0.07 28.95 0.21 + ±0.09
ip 11.11 0.12 237.06 0.17 743.73 0.18 > 10000 not converged
35/42
Evaluation of the rto Algorithm (2/2)
5 10 15 20 25 30
500
1,000 cem
de
bo
spsa
ip
∆R
Convergence
time
(min)
Time required to compute optimal intrusion recovery strategies; the
x-axis indicate different values of ∆R; the error bars indicate the 95%
confidence interval based on the Student’s t-distribution with 20
measurements.
36/42
Evaluation of the lp-r Algorithm
2 4 8 16 32 64 128 256 512 1024 2048 4096
0
100
200
Time
to
solve
(s)
smax
Time required to compute optimal scaling strategies; the x-axis indicate
different values of smax; the error bars indicate the 95% confidence
interval based on the Student’s t-distribution with 20 measurements.
37/42
Outline
▶ Use Case & Research Problem
▶ Use case: intrusion tolerance
▶ Goal: optimal control strategies for intrusion-tolerant systems
▶ Background
▶ Fault tolerance and intrusion tolerance
▶ State machine replication
▶ Our Contributions
▶ The tolerance control architecture
▶ Constrained two-level control problem
▶ Theoretical results
▶ Computational algorithms
▶ Comparison with State-of-the-art
▶ Implementation and evaluation
▶ Conclusions
38/42
We Instantiate The tolerance Control Architecture
with The Computed Control Strategies
tolerance
Node 1
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Virtualization layer
Hardware
Node 2
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Virtualization layer
Hardware
Node N
Privileged domain
Application domain
Service
replica
Node controller
ids
alerts
reco-
very
Virtualization layer
Hardware
. . .
Intrusion-tolerant consensus protocol
Client interface
System controller
State estimate Scaling action State estimate Scaling action State estimate Scaling action
. . .
Service requests Responses
Clients Attacker
Intrusion attempts
39/42
Experiment Setup - Physical Servers
Server Processors ram (gb)
1, r715 2u 2 12-core amd opteron 64
2, r715 2u 2 12-core amd opteron 64
3, r715 2u 2 12-core amd opteron 64
4, r715 2u 2 12-core amd opteron 64
5, r715 2u 2 12-core amd opteron 64
6, r715 2u 2 12-core amd opteron 64
7, r715 2u 2 12-core amd opteron 64
8, r715 2u 2 12-core amd opteron 64
9, r715 2u 2 12-core amd opteron 64
10, r630 2u 2 12-core intel xeon e5-2680 256
11, r740 2u 1 20-core intel xeon gold5218r 32
12, supermicro 7049 2 tesla p100, 1 16-core intel xeon 126
13, supermicro 7049 4 rtx 8000, 1 24-core intel xeon 768
Table 1: Specifications of the physical servers.
39/42
Experiment Setup - Replica Configurations
Replica ID Operating system Vulnerabilities
1 ubuntu 14 ftp weak password
2 ubuntu 20 ssh weak password
3 ubuntu 20 telnet weak password
4 debian 10.2 cve-2017-7494
5 ubuntu 20 cve-2014-6271
6 debian 10.2 cve-89 on cve
7 debian 10.2 cve-2015-3306
8 debian 10.2 cve-2016-10033
9 debian 10.2 cve-2010-0426, ssh weak password
10 debian 10.2 cve-2015-5602, ssh weak password
Table 2: Replica configurations.
39/42
Experiment Setup - Emulated Intrusions
Replica ID Intrusion steps
1 tcp syn scan, ftp brute force
2 tcp syn scan, ssh brute force
3 tcp syn scan, telnet brute force
4 icmp scan, exploit of cve-2017-7494
5 icmp scan, exploit of cve-2014-6271
6 icmp scan, exploit of cve-89 on on cve
7 icmp scan, exploit of cve-2015-3306
8 icmp scan, exploit of cve-2016-10033
9 icmp scan, ssh brute force, exploit of cve-2010-0426
10 icmp scan, ssh brute force, exploit of cve-2015-5602
Table 3: Intrusion steps
39/42
Experiment Setup - Background Traffic
Background services Replica ID(s)
ftp, ssh, mongodb, http, teamspeak 1
ssh, dns, http 2
ssh, telnet, http 3
ssh, samba, ntp 4
ssh 5, 7, 8, 10
cve, irc, ssh 6
teamspeak, http, ssh 9
Table 4: Background services; each background client invokes functions
on service replicas.
39/42
Experiment Setup - Consensus Algorithm
We implement and extend the minbft Byzantine fault-tolerant
consensus algorithm to be reconfigurable.
Client
Replica 1
(leader)
Replica 2
Replica 3
request prepare commit reply
Replica 1
(leader v)
(leader v + 1)
Replica 2
Replica 3
crash
request
view-change
view-change new-view
Replica 1
Replica 2
Replica 3
checkpoint
Controller
Replica 1
(compromised)
Replica 2
Replica 3
recover
request
state
state
New replica
Replica 1
(leader v)
(leader v + 1)
Replica 2
Replica 3
join-request join new-view join-reply
System
controller
Replica 1
(leader v)
(leader v + 1)
Replica 2
Replica 3
evict-request evict new-view exit-reply
a) Normal operation b) View change
c) Checkpoint
d) State transfer
e) Join f) Evict
39/42
Experiment Setup - Consensus Algorithm
Throughput of our implementation of minbft.
3 4 5 6 7 8 9 10
20
40
60
1 client 20 clients
Number of nodes (N)
Avg
throughput
(req/s)
40/42
System Identification
0 2000 4000 6000 8000
probability
cve-2010-0426
0 2000 4000 6000 8000
cve-2015-3306
0 2000 4000 6000 8000
cve-2015-5602
0 2000 4000 6000 8000
cve-2016-10033
0 2000 4000 6000 8000
cwe-89
0 2000 4000 6000 8000
O
probability
cve-2017-7494
0 2000 4000 6000 8000
O
cve-2014-6271
0 5000 10000 15000 20000
O
ftp brute force
0 5000 10000 15000 20000
O
ssh brute force
0 5000 10000 15000 20000
O
telnet brute force
intrusion no intrusion intrusion model normal operation model
Empirical observation distributions b
Z1(· | s), . . . , b
Z10(· | s) as estimates of
Z1, . . . , Z10.
▶ Empirical distributions based on M = 25, 000 samples.
▶ From the Glivenko-Cantelli theorem we know that b
Z →a.s Z
as M → ∞.
▶ Bound:
P
h
Dkl(b
Z(· | s) ∥ Z(· | s)) ≥ ϵ
i
≤ 2−M ϵ−|O|
ln(M+1)
M

= 2
−25·103

ϵ−2·103 ln(25·103+1)
25·103

= 2−5·103
(5ϵ−4 ln(25·103+1))
41/42
Comparison with State-of-the-art Intrusion-Tolerant
Systems
0
0.5
1
5 15 25 ∞
101
102
5 15 25 ∞ 0
0.1
0.2
5 15 25 ∞ 0
10
20
30
40
5 15 25 ∞
0
0.5
1
5 15 25 ∞
101
102
5 15 25 ∞ 0
0.1
0.2
5 15 25 ∞ 0
20
40
60
5 15 25 ∞
0
0.5
1
5 15 25 ∞
101
102
no-recovery periodic tolerance periodic-adaptive
5 15 25 ∞ 0
0.1
0.2
5 15 25 ∞ 0
50
100
5 15 25 ∞
Maximum time-to-recovery ∆R Maximum time-to-recovery ∆R Maximum time-to-recovery ∆R Maximum time-to-recovery ∆R
Average availability T(A) Average time-to-recovery T(R) Average recovery frequency F(R) Average cost Ji + J
N
1
=
3
N
1
=
6
N
1
=
9
Comparison between tolerance and the baselines; the columns
represent: average availability (T(A)
), average time-to-recovery (T(R)
);
average recovery frequency (F(R)
); and average cost Ji + J; the bars
indicate the mean value from evaluations with 20 different random seeds;
the error bars indicate the 95% confidence interval based on the
Student’s t-distribution.
42/42
Conclusions
▶ We present tolerance: a novel
control architecture for
intrusion-tolerant systems which
improves state-of-the-art.
▶ We prove that the optimal control
strategies have threshold
structures and design efficient
algorithms for computing them.
▶ We evaluate tolerance in an
emulation environment against 10
different types of network
intrusions.
. . .
π1(b1) π2(b2) π3(b3) π4(b4) πN(bN)
Belief
transmissions
Node controllers
Replicated
system
System controller
π(b1, . . . , bN)
b1 b2 b3 b4 bN

More Related Content

Similar to Intrusion Tolerance Through Two-Level Feedback Control

Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionKim Hammar
 
2011-05-02 - VU Amsterdam - Testing safety critical systems
2011-05-02 - VU Amsterdam - Testing safety critical systems2011-05-02 - VU Amsterdam - Testing safety critical systems
2011-05-02 - VU Amsterdam - Testing safety critical systemsJaap van Ekris
 
Cost-effective software reliability through autonomic tuning of system resources
Cost-effective software reliability through autonomic tuning of system resourcesCost-effective software reliability through autonomic tuning of system resources
Cost-effective software reliability through autonomic tuning of system resourcesVincenzo De Florio
 
Effectiveness Measurement Framework for Field-Based Experiments Focused on An...
Effectiveness Measurement Framework for Field-Based Experiments Focused on An...Effectiveness Measurement Framework for Field-Based Experiments Focused on An...
Effectiveness Measurement Framework for Field-Based Experiments Focused on An...Ivan Pretel
 
Discussion Shared Practice The Triple Bottom LineIs the o.docx
Discussion Shared Practice The Triple Bottom LineIs the o.docxDiscussion Shared Practice The Triple Bottom LineIs the o.docx
Discussion Shared Practice The Triple Bottom LineIs the o.docxelinoraudley582231
 
2010-03-31 - VU Amsterdam - Experiences testing safety critical systems
2010-03-31 - VU Amsterdam - Experiences testing safety critical systems2010-03-31 - VU Amsterdam - Experiences testing safety critical systems
2010-03-31 - VU Amsterdam - Experiences testing safety critical systemsJaap van Ekris
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Kim Hammar
 
Selecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategiesSelecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategiesMicah Altman
 
Agile performance engineering with cloud 2016
Agile performance engineering with cloud   2016Agile performance engineering with cloud   2016
Agile performance engineering with cloud 2016Ken Chan
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion ResponseKim Hammar
 
Introduction to Distributed Systems
Introduction to Distributed SystemsIntroduction to Distributed Systems
Introduction to Distributed Systemsssuser097ea8
 
Recognize, assess, reduce, and manage technical debt
Recognize, assess, reduce, and manage technical debtRecognize, assess, reduce, and manage technical debt
Recognize, assess, reduce, and manage technical debtJim Bethancourt
 
Iet Prestige Lecture Coping With Complexity 7th December
Iet Prestige Lecture Coping With Complexity 7th DecemberIet Prestige Lecture Coping With Complexity 7th December
Iet Prestige Lecture Coping With Complexity 7th DecemberFrancis_McKinney
 
Security Requirement Specification Model for Cloud Computing Services
Security Requirement Specification Model for Cloud Computing ServicesSecurity Requirement Specification Model for Cloud Computing Services
Security Requirement Specification Model for Cloud Computing ServicesMatteo Leonetti
 
Hybrid Deep Neural Networks to Infer State Models of Black-Box Systems​
Hybrid Deep Neural Networks to Infer State Models of Black-Box Systems​Hybrid Deep Neural Networks to Infer State Models of Black-Box Systems​
Hybrid Deep Neural Networks to Infer State Models of Black-Box Systems​Mohammad Jafar Mashhadi
 
Safety Enhancement for Deep Reinforcement Learning in Autonomous Separation A...
Safety Enhancement for Deep Reinforcement Learning in Autonomous Separation A...Safety Enhancement for Deep Reinforcement Learning in Autonomous Separation A...
Safety Enhancement for Deep Reinforcement Learning in Autonomous Separation A...Wei Guo
 
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Kim Hammar
 
An Empirical Study On Practicality Of Specification Mining Algorithms On A Re...
An Empirical Study On Practicality Of Specification Mining Algorithms On A Re...An Empirical Study On Practicality Of Specification Mining Algorithms On A Re...
An Empirical Study On Practicality Of Specification Mining Algorithms On A Re...Mohammad Jafar Mashhadi
 
PERICLES Ecosystem Modelling (NCDD use case) - Acting on Change 2016
PERICLES Ecosystem Modelling (NCDD use case) - Acting on Change 2016PERICLES Ecosystem Modelling (NCDD use case) - Acting on Change 2016
PERICLES Ecosystem Modelling (NCDD use case) - Acting on Change 2016PERICLES_FP7
 

Similar to Intrusion Tolerance Through Two-Level Feedback Control (20)

Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via Decomposition
 
2011-05-02 - VU Amsterdam - Testing safety critical systems
2011-05-02 - VU Amsterdam - Testing safety critical systems2011-05-02 - VU Amsterdam - Testing safety critical systems
2011-05-02 - VU Amsterdam - Testing safety critical systems
 
Cost-effective software reliability through autonomic tuning of system resources
Cost-effective software reliability through autonomic tuning of system resourcesCost-effective software reliability through autonomic tuning of system resources
Cost-effective software reliability through autonomic tuning of system resources
 
Effectiveness Measurement Framework for Field-Based Experiments Focused on An...
Effectiveness Measurement Framework for Field-Based Experiments Focused on An...Effectiveness Measurement Framework for Field-Based Experiments Focused on An...
Effectiveness Measurement Framework for Field-Based Experiments Focused on An...
 
Discussion Shared Practice The Triple Bottom LineIs the o.docx
Discussion Shared Practice The Triple Bottom LineIs the o.docxDiscussion Shared Practice The Triple Bottom LineIs the o.docx
Discussion Shared Practice The Triple Bottom LineIs the o.docx
 
2010-03-31 - VU Amsterdam - Experiences testing safety critical systems
2010-03-31 - VU Amsterdam - Experiences testing safety critical systems2010-03-31 - VU Amsterdam - Experiences testing safety critical systems
2010-03-31 - VU Amsterdam - Experiences testing safety critical systems
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Selecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategiesSelecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategies
 
Agile performance engineering with cloud 2016
Agile performance engineering with cloud   2016Agile performance engineering with cloud   2016
Agile performance engineering with cloud 2016
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion Response
 
intro_to_dis.pdf
intro_to_dis.pdfintro_to_dis.pdf
intro_to_dis.pdf
 
Introduction to Distributed Systems
Introduction to Distributed SystemsIntroduction to Distributed Systems
Introduction to Distributed Systems
 
Recognize, assess, reduce, and manage technical debt
Recognize, assess, reduce, and manage technical debtRecognize, assess, reduce, and manage technical debt
Recognize, assess, reduce, and manage technical debt
 
Iet Prestige Lecture Coping With Complexity 7th December
Iet Prestige Lecture Coping With Complexity 7th DecemberIet Prestige Lecture Coping With Complexity 7th December
Iet Prestige Lecture Coping With Complexity 7th December
 
Security Requirement Specification Model for Cloud Computing Services
Security Requirement Specification Model for Cloud Computing ServicesSecurity Requirement Specification Model for Cloud Computing Services
Security Requirement Specification Model for Cloud Computing Services
 
Hybrid Deep Neural Networks to Infer State Models of Black-Box Systems​
Hybrid Deep Neural Networks to Infer State Models of Black-Box Systems​Hybrid Deep Neural Networks to Infer State Models of Black-Box Systems​
Hybrid Deep Neural Networks to Infer State Models of Black-Box Systems​
 
Safety Enhancement for Deep Reinforcement Learning in Autonomous Separation A...
Safety Enhancement for Deep Reinforcement Learning in Autonomous Separation A...Safety Enhancement for Deep Reinforcement Learning in Autonomous Separation A...
Safety Enhancement for Deep Reinforcement Learning in Autonomous Separation A...
 
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
 
An Empirical Study On Practicality Of Specification Mining Algorithms On A Re...
An Empirical Study On Practicality Of Specification Mining Algorithms On A Re...An Empirical Study On Practicality Of Specification Mining Algorithms On A Re...
An Empirical Study On Practicality Of Specification Mining Algorithms On A Re...
 
PERICLES Ecosystem Modelling (NCDD use case) - Acting on Change 2016
PERICLES Ecosystem Modelling (NCDD use case) - Acting on Change 2016PERICLES Ecosystem Modelling (NCDD use case) - Acting on Change 2016
PERICLES Ecosystem Modelling (NCDD use case) - Acting on Change 2016
 

More from Kim Hammar

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesKim Hammar
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHKim Hammar
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Kim Hammar
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security AutomationKim Hammar
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Kim Hammar
 
Intrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingIntrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingKim Hammar
 
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...Kim Hammar
 
Self-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseSelf-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseKim Hammar
 
Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Kim Hammar
 
Learning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingLearning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingKim Hammar
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingKim Hammar
 
Intrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayIntrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayKim Hammar
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Kim Hammar
 
Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Kim Hammar
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingKim Hammar
 
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...Kim Hammar
 
Reinforcement Learning Algorithms for Adaptive Cyber Defense against Heartbleed
Reinforcement Learning Algorithms for Adaptive Cyber Defense against HeartbleedReinforcement Learning Algorithms for Adaptive Cyber Defense against Heartbleed
Reinforcement Learning Algorithms for Adaptive Cyber Defense against HeartbleedKim Hammar
 
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021Kim Hammar
 
Självlärande system för cybersäkerhet
Självlärande system för cybersäkerhetSjälvlärande system för cybersäkerhet
Självlärande system för cybersäkerhetKim Hammar
 
Learning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal StoppingLearning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal StoppingKim Hammar
 

More from Kim Hammar (20)

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTH
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security Automation
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.
 
Intrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingIntrusion Response through Optimal Stopping
Intrusion Response through Optimal Stopping
 
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
 
Self-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseSelf-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber Defense
 
Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.
 
Learning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingLearning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal Stopping
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal Stopping
 
Intrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayIntrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-Play
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
 
Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal Stopping
 
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
 
Reinforcement Learning Algorithms for Adaptive Cyber Defense against Heartbleed
Reinforcement Learning Algorithms for Adaptive Cyber Defense against HeartbleedReinforcement Learning Algorithms for Adaptive Cyber Defense against Heartbleed
Reinforcement Learning Algorithms for Adaptive Cyber Defense against Heartbleed
 
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
 
Självlärande system för cybersäkerhet
Självlärande system för cybersäkerhetSjälvlärande system för cybersäkerhet
Självlärande system för cybersäkerhet
 
Learning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal StoppingLearning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal Stopping
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Intrusion Tolerance Through Two-Level Feedback Control

  • 1. 1/42 Intrusion Tolerance for Networked Systems Through Two-Level Feedback Control NSE Seminar Kim Hammar kimham@kth.se Division of Network and Systems Engineering KTH Royal Institute of Technology Nov 10, 2023
  • 2. 2/42 Use Case: Intrusion Tolerance . . . Clients api gateways Compute nodes Storage nodes Service replica 1 Service replica 2 Service replica 3 Service replica 4 Client interface & load balancer ▶ A replicated system offers services to a client population. ▶ The system must be highly available and provide correct service without disruption.
  • 3. 2/42 Use Case: Intrusion Tolerance . . . Attacker Clients api gateways Compute nodes Storage nodes Service replica 1 Service replica 2 Service replica 3 Service replica 4 Client interface & load balancer ▶ An attacker seeks to intrude on the system and disrupt service ▶ The system should tolerate intrusions ▶ it should provide correct service even if a fraction of replicas are compromised
  • 4. 2/42 Examples of Systems that Need to Tolerate Intrusions . . . Control planes Embedded systems SCADA systems Payment systems
  • 5. 3/42 Intrusion Tolerance (Simplified) Intrusion event Time of full recovery Time Recovery time Survivability Loss Normal performance System performance Tolerance Cumulative performance loss (want to minimize)
  • 6. 3/42 Intrusion-Tolerant Systems - State of The Art ▶ State-of-the-art intrusion-tolerant systems involve 3 building blocks: 1. a protocol for service replication 2. a scaling strategy 3. a recovery strategy ▶ Given N replicas, the system provides correct service with up to f = N−1 3 compromised replicas. ▶ Theoretical upper bound ▶ f is the tolerance threshold. ▶ Simple control strategies: ▶ Fixed number of replicas (no scaling) ▶ No recovery or periodic recovery . . . Clients Service replica 1 Service replica 2 Service replica 3 Service replica 4 Client interface & load balancer
  • 7. 3/42 Intrusion-Tolerant Systems - State of The Art ▶ State-of-the-art intrusion-tolerant systems involve 3 building blocks: 1. a protocol for service replication 2. a scaling strategy 3. a recovery strategy ▶ Given N replicas, the system provides correct service with up to f = N−1 3 compromised replicas. ▶ Theoretical upper bound ▶ f is the tolerance threshold. ▶ Simple control strategies: ▶ Fixed number of replicas (no scaling) ▶ No recovery or periodic recovery . . . Clients Service replica 1 Service replica 2 Service replica 3 Service replica 4 Client interface & load balancer This work: Optimal intrusion recovery and scaling strategies
  • 8. 4/42 Can we use decision theory and learning-based methods to automatically find effective security strategies? Intrusion prevention Simulation. Small-scale. (2020)1. Intrusion prevention Optimal stopping. Emulation, small-scale Static attacker. (2021)2. Intrusion response Optimal multiple stopping. Emulation, small-scale. Static attacker. (2022)3 Intrusion response Dynkin game. Emulation, small-scale. Dynamic attacker. (2022)4 Intrusion response Decomposition. Emulation, large-scale Dynamic attacker. (2023)5 Intrusion tolerance Two-level control. Integration with bft. Static attacker. (This work) 1 Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and Self-Play”. In: International Conference on Network and Service Management (CNSM 2020). Izmir, Turkey, 2020. 2 Kim Hammar and Rolf Stadler. “Learning Intrusion Prevention Policies through Optimal Stopping”. In: International Conference on Network and Service Management (CNSM 2021). Izmir, Turkey, 2021. 3 Kim Hammar and Rolf Stadler. “Intrusion Prevention Through Optimal Stopping”. In: IEEE Transactions on Network and Service Management 19.3 (2022), pp. 2333–2348. doi: 10.1109/TNSM.2022.3176781. 4 Kim Hammar and Rolf Stadler. “Learning Near-Optimal Intrusion Responses Against Dynamic Attackers”. In: IEEE Transactions on Network and Service Management (2023), pp. 1–1. doi: 10.1109/TNSM.2023.3293413. 5 Kim Hammar and Rolf Stadler. “Scalable Learning of Intrusion Response through Recursive Decomposition”. In: 14th International Conference on Decision and Game Theory for Security. Avignon, France, 2023.
  • 9. 5/42 Our Framework for Automated Security s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Strategy Mapping π Selective Replication Strategy Implementation π Simulation System Learning & Optimization Strategy evaluation & Model estimation Automation & Self-learning systems ▶ Source code: https://github.com/Limmen/csle ▶ Documentation: http://limmen.dev/csle/ ▶ Demo: https://www.youtube.com/watch?v=iE2KPmtIs2A&
  • 10. 6/42 Outline ▶ Use Case & Research Problem ▶ Use case: intrusion tolerance ▶ Goal: optimal control strategies for intrusion-tolerant systems ▶ Background ▶ Fault tolerance and intrusion tolerance ▶ State machine replication ▶ Our Contributions ▶ The tolerance control architecture ▶ Constrained two-level control problem ▶ Theoretical results ▶ Computational algorithms ▶ Comparison with State-of-the-art ▶ Implementation and evaluation ▶ Conclusions
  • 11. 6/42 Outline ▶ Use Case & Research Problem ▶ Use case: intrusion tolerance ▶ Goal: optimal control strategies for intrusion-tolerant systems ▶ Background ▶ Fault tolerance and intrusion tolerance ▶ State machine replication ▶ Our Contributions ▶ The tolerance control architecture ▶ Constrained two-level control problem ▶ Theoretical results ▶ Computational algorithms ▶ Comparison with State-of-the-art ▶ Implementation and evaluation ▶ Conclusions
  • 12. 6/42 Outline ▶ Use Case & Research Problem ▶ Use case: intrusion tolerance ▶ Goal: optimal control strategies for intrusion-tolerant systems ▶ Background ▶ Fault tolerance and intrusion tolerance ▶ State machine replication ▶ Our Contributions ▶ The tolerance control architecture ▶ Constrained two-level control problem ▶ Theoretical results ▶ Computational algorithms ▶ Comparison with State-of-the-art ▶ Implementation and evaluation ▶ Conclusions
  • 13. 6/42 Outline ▶ Use Case & Research Problem ▶ Use case: intrusion tolerance ▶ Goal: optimal control strategies for intrusion-tolerant systems ▶ Background ▶ Fault tolerance and intrusion tolerance ▶ State machine replication ▶ Our Contributions ▶ The tolerance control architecture ▶ Constrained two-level control problem ▶ Theoretical results ▶ Computational algorithms ▶ Comparison with State-of-the-art ▶ Implementation and evaluation ▶ Conclusions
  • 14. 6/42 Outline ▶ Use Case & Research Problem ▶ Use case: intrusion tolerance ▶ Goal: optimal control strategies for intrusion-tolerant systems ▶ Background ▶ Fault tolerance and intrusion tolerance ▶ State machine replication ▶ Our Contributions ▶ The tolerance control architecture ▶ Constrained two-level control problem ▶ Theoretical results ▶ Computational algorithms ▶ Comparison with State-of-the-art ▶ Implementation and evaluation ▶ Conclusions
  • 15. 7/42 Background: Fault Tolerance and Intrusion Tolerance ▶ Seminal work made by von Neumann and Shannon in 1956. ▶ Initially focused on building fault-tolerant circuits. ▶ Fault tolerance includes tolerance against: software bugs, malicious attacks, operator mistakes, etc. ▶ Key strategy for fault tolerance: redundancy ▶ Redundancy is achieved through service replication. Replicas are coordinated through a consensus protocol. . . . Replicated system Client interface Request Response Consensus protocol 1 2 3 4 5
  • 16. 8/42 Background: Consensus ▶ Consensus is the problem of reaching agreement on a single value among a set of distributed nodes subject to failures. ▶ Fascinating problem for many reasons: ▶ Key problem to build practical systems ▶ The problem comes in many flavors ▶ Paradoxical cases ▶ Rich theory ▶ Turing awardees working on consensus: Dijkstra, 1972 Gray, 1998 Liskov, 2008 Lamport, 2013
  • 17. 9/42 Background: Consensus Definition (Consensus) We have N nodes indexed by 1, . . . , N. The nodes are connected by a complete graph and communicate via message passing. Each node starts with an input value v ∈ V. The goal is to agree on a single value in V. An algorithm A solves consensus if the following hold: 1. Agreement: No two correct nodes decide on different values 2. Termination: All nodes eventually decide. 3. Validity: If all nodes start with input v then they decide on v
  • 18. 10/42 Consensus Example: The Two Generals Problem (Gray ’78) ▶ Two generals are planning a coordinated attack from different directions. ▶ One of the simplest consensus problems: ▶ Only two nodes: 1 and 2 ▶ No process failures but link failures may occur ▶ Is the problem solvable?
  • 19. 10/42 Consensus Example: The Two Generals Problem (Gray ’78) ▶ Two generals are planning a coordinated attack from different directions. ▶ One of the simplest consensus problems: ▶ Only two nodes: 1 and 2 ▶ No process failures but link failures may occur ▶ Is the problem solvable? No! Can be proven by contradiction.
  • 20. 11/42 When Is Consensus Solvable? ▶ Solvability depends on synchrony and failure assumptions. ▶ Three synchronicity models: ▶ The asynchronous model: no bounds on either delays or clock drifts. ▶ The partially synchronous model: an upper bound exists but the system may have periods of instability where the upper bound does not hold. ▶ The synchronous model: there is an upper bound on the communication delay and clock drift between any two nodes. . . . Synchronous system Asynchronous system Partially synchronous system Internet Organization WAN
  • 21. 12/42 When Is Consensus Solvable? ▶ Three main failure models: ▶ Crash-stop: nodes fail by crashing. ▶ Byzantine: failed nodes may behave arbitrarily (e.g., be controlled by an attacker) ▶ Hybrid: Byzantine failures but each node is equipped with a trusted component that only fails by crashing. Byzantine failure Crash-stop failure Virtual machine Hybrid failure
  • 22. 13/42 When Is Consensus Solvable? Theorem (Summary of 40 years of research) ▶ Consensus is not solvable in the asynchronous model ▶ Consensus is solvable with a reliable network in the partially synchronous model with N nodes and up to ▶ f = N−1 2 Crash-stop failures ▶ f = N−1 3 Byzantine failures ▶ f = N−1 2 Hybrid failures (assuming authenticated channels) ▶ Consensus is solvable with a reliable network in the synchronous model with N nodes and up to ▶ f = N − 1 Crash-stop failures ▶ f = N−1 2 Byzantine failures (assuming authenticated channels) ▶ f = N−1 2 Hybrid failures (assuming authenticated channels)
  • 23. 14/42 Outline ▶ Use Case & Research Problem ▶ Use case: intrusion tolerance ▶ Goal: optimal control strategies for intrusion-tolerant systems ▶ Background ▶ Fault tolerance and intrusion tolerance ▶ State machine replication ▶ Our Contributions ▶ The tolerance control architecture ▶ Constrained two-level control problem ▶ Theoretical results ▶ Computational algorithms ▶ Comparison with State-of-the-art ▶ Implementation and evaluation ▶ Conclusions
  • 24. 15/42 Two-Level Feedback Control for Intrusion Tolerance . . . π1(b1) π2(b2) π3(b3) π4(b4) πN(bN) Belief transmissions Node controllers Replicated system System controller π(b1, . . . , bN) b1 b2 b3 b4 bN ▶ Node controllers with strategies π1, . . . , πN compute belief states b1, . . . , bN and make local recovery decisions; the belief states are transmitted to a global system controller with strategy π, which controls the replication factor ▶ Key insight: the control problems correspond to classical problems studied in operations research, namely the machine replacement problem and the inventory replenishment problem, both of which have been studied for nearly a century.
  • 25. 16/42 The tolerance Control Architecture tolerance: Two-level recovery and scaling control with feedback. tolerance Node 1 Privileged domain Application domain Service replica Node controller ids alerts reco- very Virtualization layer Hardware Node 2 Privileged domain Application domain Service replica Node controller ids alerts reco- very Virtualization layer Hardware Node N Privileged domain Application domain Service replica Node controller ids alerts reco- very Virtualization layer Hardware . . . Intrusion-tolerant consensus protocol Client interface System controller State estimate Scaling action State estimate Scaling action State estimate Scaling action . . . Service requests Responses Clients Attacker Intrusion attempts
  • 26. 17/42 Correctness of tolerance (1/2) Definition (Correct service) We say that a system provides correct service if the healthy replicas satisfy the following properties: Each replica executes the same request sequence. (Safety) Each request is eventually executed. (Liveness) Each executed request was sent by a client. (Validity)
  • 27. 18/42 Correctness of tolerance (2/2) Proposition A system that implements the tolerance architecture provides correct service provided that: 1. The controllers can only fail by crashing. 2. Network links are authenticated and reliable. 3. An attacker can not break cryptographic codes. 4. The system is partially synchronous. 5. At most k nodes recover simultaneously and at most f nodes are compromised or crashed simultaneously. 6. Nt ≥ 2f + 1 + k at all times t. Remark: tolerance does not ensure confidentiality as a compromised node may leak information to the attacker. By appropriate use of cryptography and firewalls, it is possible to extend tolerance to provide confidentiality. Details omitted.
  • 28. 18/42 Correctness of tolerance (2/2) Proposition A system that implements the tolerance architecture provides correct service provided that: 1. The controllers can only fail by crashing. 2. Network links are authenticated and reliable. 3. An attacker can not break cryptographic codes. 4. The system is partially synchronous. 5. At most k nodes recover simultaneously and at most f nodes are compromised or crashed simultaneously. 6. Nt ≥ 2f + 1 + k at all times t. Remark: tolerance does not ensure confidentiality as a compromised node may leak information to the attacker. By appropriate use of cryptography and firewalls, it is possible to extend tolerance to provide confidentiality. Details omitted.
  • 29. 19/42 The Local Level: Intrusion Recovery Control (1/4) ▶ Nodes Nt ≜ {1, 2, . . . , Nt} and controllers π1,t, π2,t, . . . , πN,t. ▶ Hidden states SN = {H, C, ∅} (see figure). ▶ Actions: (W)ait and (R)ecover ▶ Observation oi,t ∼ Z represents the number of ids alerts related to node i at time t. ▶ A node controller computes bi,t ≜ P[Si,t = C | oi,1, . . .] and makes decisions ai,t = πi,t(bi,t) ∈ {W, R}. H C ∅ Crashed Healthy Compromised pC1 pC2 pR or a = R pA
  • 30. 20/42 The Local Level: Probability of Failure (2/4) 10 20 30 40 50 60 70 80 90 100 0.5 1 pA = 0.1 pA = 0.05 pA = 0.025 pA = 0.01 pA = 0.005 t P[St = C ∪ St = ∅] Probability of node compromise (St = C) or crash (St = ∅) in function of time t, assuming no recoveries.
  • 31. 21/42 The Local Level: Intrusion Recovery Control (3/4) Goals: minimize the average time-to-recovery T (R) i and minimize the frequency of recoveries F (R) i : minimize Ji ≜ lim T→∞ h ηT (R) i,T + F (R) i,T i (1) = lim T→∞ " 1 T T X t=1 ηsi,t − ai,tηsi,t + ai,t | {z } ≜cN(si,t ,ai,t ) # We define intrusion recovery to be the problem of minimizing the above objective subject to a bounded-time-to-recovery (btr) safety constraint which ensures that the time between two recoveries of a node is bounded to by ∆R, which can be configured by the system administrator.
  • 32. 22/42 The Local Level: Intrusion Recovery Control (4/4) Problem (Optimal Intrusion Recovery Control) minimize πi,t ∈ΠN Eπi,t [Ji | Bi,1 = pA] ∀i ∈ N (2a) subject to ai,k∆R = R ∀i, k (2b) si,t+1 ∼ fN(· | si,t, ai,t) ∀i, t (2c) oi,t+1 ∼ Z(· | si,t) ∀i, t (2d) ai,t+1 ∼ πi,t(bi,t) ∀i, t (2e) ai,t ∈ AN, si,t ∈ SN, oit, ∈ O ∀i, t (2f)
  • 33. 23/42 Numerical Results for the Intrusion Recovery Problem 0.0 0.5 1.0 bi,1 2 3 4 V ? i,1 (b i,1 ) pA = 0.1 0.0 0.5 1.0 bi,1 2 4 V ? i,1 (b i,1 ) pA = 0.05 0.0 0.5 1.0 bi,1 1 2 3 V ? i,1 (b i,1 ) pA = 0.025 0.0 0.5 1.0 bi,1 2 4 V ? i,1 (b i,1 ) pA = 0.01 0.0 0.5 1.0 bi,1 0 2 4 V ? i,1 (b i,1 ) pA = 0.005 Illustration of the optimal value function V ⋆ i,t(bi,t) for the local control problem. V ⋆ i,t was computed using the incremental pruning algorithm; black lines indicate the value function; red lines indicate the alpha-vectors.
  • 34. 23/42 Numerical Results for the Intrusion Recovery Problem 0.0 0.5 1.0 bi,1 2 3 4 V ? i,1 (b i,1 ) pA = 0.1 0.0 0.5 1.0 bi,1 2 4 V ? i,1 (b i,1 ) pA = 0.05 0.0 0.5 1.0 bi,1 1 2 3 V ? i,1 (b i,1 ) pA = 0.025 0.0 0.5 1.0 bi,1 2 4 V ? i,1 (b i,1 ) pA = 0.01 0.0 0.5 1.0 bi,1 0 2 4 V ? i,1 (b i,1 ) pA = 0.005 Illustration of the optimal value function V ⋆ i,t(bi,t) for the local control problem. V ⋆ i,t was computed using the incremental pruning algorithm; black lines indicate the value function; red lines indicate the alpha-vectors. Based on the above results we hypothesize that there exists an optimal recovery strategy of the form: 0 1 1 0 ai,t = π⋆ i,t(b) α⋆ t b (W)ait (R)ecover W R
  • 35. 24/42 Structure of an Optimal Intrusion Recovery Strategy (1/2) Theorem (Optimal Threshold Recovery Strategies) If the following holds pA, pU, pC1 , pC2 ∈ (0, 1) (A) pA + pU ≤ 1 (B) pC1 (pU − 1) pA(pC1 − 1) + pC1 (pU − 1) ≤ pC2 (C) Z(oi,t | si,t) > 0 ∀oi,t, si,t (D) Z is tp-2 (E) then there exists an optimal recovery strategy π⋆ i,t for each node i ∈ N that satisfies π⋆ i,t(bi,t) = R ⇐⇒ bi,t ≥ α⋆ t ∀t, α⋆ t ∈ [0, 1] (3)
  • 36. 25/42 Structure of an Optimal Intrusion Recovery Strategy (2/2) Corollary (Stationary Optimal Strategy as ∆R → ∞) The recovery thresholds satisfy α⋆ t+1 ≥ α⋆ t for all t ∈ [k∆R, (k + 1)∆R] and as ∆R → ∞, the thresholds converge to a time-independent threshold α⋆. 20 40 60 80 100 0.85 0.9 0.95 α⋆ t t The thresholds were computed using the Incremental Pruning algorithm.
  • 37. 26/42 The Global Level: Controlling the Replication Factor (1/6) . . . π1(b1) π2(b2) π3(b3) π4(b4) πN(bN) Belief transmissions Node controllers Replicated system System controller π(b1, . . . , bN) b1 b2 b3 b4 bN
  • 38. 27/42 The Global Level: Controlling the Replication Factor (2/6) 0 1 smax −1 smax . . . at = 1 at = 1 at = 0 at = 0 at = 0 at = 0 ▶ At each time t, the system controller receives the belief states b1,t, . . . , bN,t from the node controllers and decides if the replication factor N should be increased. ▶ State st: estimated number of healthy nodes based on b1,t, . . . , bN,t ▶ Actions: at ∈ {0, 1} ≜ AS, where at = 1 means that a new node is added to the system and at = 0 is a passive action. ▶ smax is the maximum number of nodes (needed to define the theoretical model). In practice smax may be very large.
  • 39. 28/42 The Global Level (3/6): Mean Time to Failure 10 20 30 40 50 60 70 80 90 100 200 400 600 pA = 0.1 pA = 0.05 pA = 0.025 pA = 0.01 pA = 0.005 N1 E[T(f )] Mean time to failure (mttf) in function of the initial number of nodes N1; T(f ) is a random variable representing the time when Nt < f + 1 with f = 3 and k = 1; the curves relate to different intrusion probabilities pA.
  • 40. 29/42 The Global Level (4/6): Reliability Curve 10 20 30 40 50 60 70 80 90 100 0.2 0.4 0.6 0.8 1 N1 = 25 N1 = 50 N1 = 100 N1 = 200 N1 = 400 t R(t) Reliability curves for varying number of nodes N; The reliability function is defined as R(t) ≜ P[T(f ) > t] where T(f ) is a random variable representing the time when Nt < f + 1 with f = 3.
  • 41. 30/42 The Global Level (5/6): Controlling the Replication Factor Goal: maximize the average service availability T(A) and minimize the number of nodes. We model these two goals with the following constrained objective minimize J ≜ lim T→∞ " 1 T T X t=1 st # (4) subject to T(A) ≥ ϵA =⇒ lim T→∞ " 1 T T X t=1 JNt ≥ 2f + 1K # ≥ ϵA =⇒ lim T→∞ " 1 T T X t=1 Jst ≥ f + 1K # ≥ ϵA where ϵA is the minimum allowed average service availability with respect to the tolerance threshold f .
  • 42. 31/42 The Global Level (6/6): Controlling the Replication Factor Problem (Optimal Control of the Replication Factor) minimize π∈ΠS Eπ [J | S1 = N] (5a) subject to Eπ h T(A) i ≥ ϵA ∀t (5b) st+1 = fS(st, at, δt) ∈ SS ∀t (5c) δt ∼ p∆(st) ∀t (5d) at+1 ∼ πt(st) ∈ AS ∀t (5e)
  • 43. 32/42 Structure of an Optimal Scaling Strategy Theorem If the following holds ∃π ∈ ΠS such that Eπ h T(A) i ≥ ϵA (A) fS(s′ | s, a) > 0 ∀s′ , s, a (B) smax X s′=s fS(s′ | ŝ + 1, a) ≥ smax X s′=s fS(s′ | ŝ, a) ∀s, ŝ, a (C) then there exists two strategies πλ1 and πλ1 that satisfy πλ1 (st) = 1 ⇐⇒ st ≤ β1 πλ2 (st) = 1 ⇐⇒ st ≤ β2 ∀t (6) and an optimal randomized threshold strategy π⋆ that satisfies π⋆ (st) = κπλ1 (st) + (1 − κ)πλ2 (st) ∀t (7) for some probability κ ∈ [0, 1].
  • 44. 33/42 Efficient Algorithms for Computing the Optimal Control Strategies ▶ The problem of computing an optimal scaling strategy has polynomial time-complexity. This follows because the problem can be formulated as a linear program of polynomial size. ▶ The problem of computing the optimal intrusion recovery strategies is in the complexity class pspace-hard, and thus no efficient (polynomial-time) algorithm for solving this problem is known. (Note that p ⊆ np ⊆ pspace.) ▶ To manage the high computational complexity of computing the optimal recovery strategies we leverage Theorem 1 and Corollary 1.
  • 45. 33/42 Algorithm for The Local Control Problem Algorithm 1: Recovery Threshold Optimization (rto) 1 Input: η, pA, pC1 , pC2 , pU, Z, ∆R 2 Parametric optimization algorithm: PO 3 Output: A near-optimal local control strategy π̂θ,t 4 Algorithm 5 d ← 1 − ∆R if ∆R < ∞ else d ← 1 6 Θ ← [0, 1]d 7 For each θ ∈ Θ, define πi,θ(bt) as πi,θ(bt) ≜ ( R if bt ≥ θi where i = max[t, d] W otherwise 8 Jθ ← Eπi,θ [Ji ] 9 π̂θ,t ← PO(Θ, Jθ) 10 return π̂θ,t
  • 46. 33/42 Algorithm for The Global Control Problem Algorithm 2: Linear Program for Scaling (lp-r) 1 Input: smax, ϵA, N, f , p∆ 2 Linear programming solver: LPSolver 3 Output: An optimal global control strategy π⋆ 4 Algorithm 5 Solve the following linear program with LPSolver minimize ρ X s∈SS X a∈AS sρ(s, a) (8a) subject to (8b) ρ(s, a) ≥ 0 ∀s ∈ SS, a ∈ AS (8c) X s∈SS X a∈AS ρ(s, a) = 1 (8d) X a∈AS ρ(s, a) = X s′∈SS X a∈AS ρ(s′ , a)fS(s′ |s, a) ∀s ∈ SS (8e) X s∈SS X a∈AS ρ(s, a)Jst ≥ f + 1K ≥ ϵA (8f) 6 Let ρ⋆ denote the solution to the above program and define π⋆ as π⋆ (a|s) ≜ ρ⋆(s, a) P s∈SS ρ⋆(s, a) ∀s ∈ SS, a ∈ AS return π⋆
  • 47. 34/42 Evaluation of the rto Algorithm (1/2) 0.0 0.2 0.09 0.10 0.11 Avg cost J i ∆R = 3 0.0 2.5 0.12 0.14 ∆R = 4 0 10 0.12 0.15 0.18 ∆R = 5 0 5 0.15 0.20 ∆R = 6 0 10 0.15 0.20 0.25 ∆R = 7 0 10 0.20 0.25 ∆R = 8 0 10 0.20 0.25 ∆R = 9 0 10 0.20 0.25 ∆R = 10 0 10 0.20 0.25 Avg cost J i ∆R = 11 0 10 0.20 0.25 ∆R = 12 0 10 0.20 0.25 ∆R = 13 0 20 0.20 0.25 ∆R = 14 0 20 0.20 0.25 ∆R = 15 0 20 0.20 0.25 ∆R = 16 0 10 0.20 0.25 ∆R = 17 0 10 0.20 0.25 ∆R = 18 0 10 Time (min) 0.20 0.25 Avg cost J i ∆R = 19 0 20 Time (min) 0.20 0.25 ∆R = 20 0 20 Time (min) 0.20 0.25 0.30 ∆R = 21 0 20 Time (min) 0.20 0.25 0.30 ∆R = 22 0 20 Time (min) 0.20 0.25 0.30 ∆R = 23 0 20 Time (min) 0.20 0.25 0.30 ∆R = 24 0 20 Time (min) 0.20 0.25 0.30 ∆R = 25 0 20 Time (min) 0.20 0.40 ∆R = ∞ optimal cem de bo spsa ppo lower bound mean values from evaluations with 20 different random seeds; ± indicate the 95% confidence interval based on the Student’s t-distribution. Method ∆R = 5 ∆R = 15 ∆R = 25 ∆R = ∞ Time (min) Ji Time (min) Ji Time (min) Ji Time (min) Ji cem 1.04 0.12 ± 0.01 8.84 0.17 ± 0.06 14.48 0.19 ± 0.08 11.81 0.16 ± 0.01 de 2.35 0.12 ± 0.03 8.98 0.17 ± 0.01 15.45 0.18 ± 0.02 22.68 0.16 ± 0.01 spsa 10.78 0.18 ± 0.01 88.35 0.58 ± 0.40 123.85 0.77 ± 0.48 4.20 0.20 ± 0.02 bo 29.18 0.12 ± 0.02 62.57 0.17 ± 0.05 90.26 0.18 ± 0.12 9.07 0.15 ± 0.06 ppo 28.20 0.18 ± 0.01 30.01 0.19 ± 0.02 30.33 0.21 ± 0.07 28.95 0.21 + ±0.09 ip 11.11 0.12 237.06 0.17 743.73 0.18 > 10000 not converged
  • 48. 35/42 Evaluation of the rto Algorithm (2/2) 5 10 15 20 25 30 500 1,000 cem de bo spsa ip ∆R Convergence time (min) Time required to compute optimal intrusion recovery strategies; the x-axis indicate different values of ∆R; the error bars indicate the 95% confidence interval based on the Student’s t-distribution with 20 measurements.
  • 49. 36/42 Evaluation of the lp-r Algorithm 2 4 8 16 32 64 128 256 512 1024 2048 4096 0 100 200 Time to solve (s) smax Time required to compute optimal scaling strategies; the x-axis indicate different values of smax; the error bars indicate the 95% confidence interval based on the Student’s t-distribution with 20 measurements.
  • 50. 37/42 Outline ▶ Use Case & Research Problem ▶ Use case: intrusion tolerance ▶ Goal: optimal control strategies for intrusion-tolerant systems ▶ Background ▶ Fault tolerance and intrusion tolerance ▶ State machine replication ▶ Our Contributions ▶ The tolerance control architecture ▶ Constrained two-level control problem ▶ Theoretical results ▶ Computational algorithms ▶ Comparison with State-of-the-art ▶ Implementation and evaluation ▶ Conclusions
  • 51. 38/42 We Instantiate The tolerance Control Architecture with The Computed Control Strategies tolerance Node 1 Privileged domain Application domain Service replica Node controller ids alerts reco- very Virtualization layer Hardware Node 2 Privileged domain Application domain Service replica Node controller ids alerts reco- very Virtualization layer Hardware Node N Privileged domain Application domain Service replica Node controller ids alerts reco- very Virtualization layer Hardware . . . Intrusion-tolerant consensus protocol Client interface System controller State estimate Scaling action State estimate Scaling action State estimate Scaling action . . . Service requests Responses Clients Attacker Intrusion attempts
  • 52. 39/42 Experiment Setup - Physical Servers Server Processors ram (gb) 1, r715 2u 2 12-core amd opteron 64 2, r715 2u 2 12-core amd opteron 64 3, r715 2u 2 12-core amd opteron 64 4, r715 2u 2 12-core amd opteron 64 5, r715 2u 2 12-core amd opteron 64 6, r715 2u 2 12-core amd opteron 64 7, r715 2u 2 12-core amd opteron 64 8, r715 2u 2 12-core amd opteron 64 9, r715 2u 2 12-core amd opteron 64 10, r630 2u 2 12-core intel xeon e5-2680 256 11, r740 2u 1 20-core intel xeon gold5218r 32 12, supermicro 7049 2 tesla p100, 1 16-core intel xeon 126 13, supermicro 7049 4 rtx 8000, 1 24-core intel xeon 768 Table 1: Specifications of the physical servers.
  • 53. 39/42 Experiment Setup - Replica Configurations Replica ID Operating system Vulnerabilities 1 ubuntu 14 ftp weak password 2 ubuntu 20 ssh weak password 3 ubuntu 20 telnet weak password 4 debian 10.2 cve-2017-7494 5 ubuntu 20 cve-2014-6271 6 debian 10.2 cve-89 on cve 7 debian 10.2 cve-2015-3306 8 debian 10.2 cve-2016-10033 9 debian 10.2 cve-2010-0426, ssh weak password 10 debian 10.2 cve-2015-5602, ssh weak password Table 2: Replica configurations.
  • 54. 39/42 Experiment Setup - Emulated Intrusions Replica ID Intrusion steps 1 tcp syn scan, ftp brute force 2 tcp syn scan, ssh brute force 3 tcp syn scan, telnet brute force 4 icmp scan, exploit of cve-2017-7494 5 icmp scan, exploit of cve-2014-6271 6 icmp scan, exploit of cve-89 on on cve 7 icmp scan, exploit of cve-2015-3306 8 icmp scan, exploit of cve-2016-10033 9 icmp scan, ssh brute force, exploit of cve-2010-0426 10 icmp scan, ssh brute force, exploit of cve-2015-5602 Table 3: Intrusion steps
  • 55. 39/42 Experiment Setup - Background Traffic Background services Replica ID(s) ftp, ssh, mongodb, http, teamspeak 1 ssh, dns, http 2 ssh, telnet, http 3 ssh, samba, ntp 4 ssh 5, 7, 8, 10 cve, irc, ssh 6 teamspeak, http, ssh 9 Table 4: Background services; each background client invokes functions on service replicas.
  • 56. 39/42 Experiment Setup - Consensus Algorithm We implement and extend the minbft Byzantine fault-tolerant consensus algorithm to be reconfigurable. Client Replica 1 (leader) Replica 2 Replica 3 request prepare commit reply Replica 1 (leader v) (leader v + 1) Replica 2 Replica 3 crash request view-change view-change new-view Replica 1 Replica 2 Replica 3 checkpoint Controller Replica 1 (compromised) Replica 2 Replica 3 recover request state state New replica Replica 1 (leader v) (leader v + 1) Replica 2 Replica 3 join-request join new-view join-reply System controller Replica 1 (leader v) (leader v + 1) Replica 2 Replica 3 evict-request evict new-view exit-reply a) Normal operation b) View change c) Checkpoint d) State transfer e) Join f) Evict
  • 57. 39/42 Experiment Setup - Consensus Algorithm Throughput of our implementation of minbft. 3 4 5 6 7 8 9 10 20 40 60 1 client 20 clients Number of nodes (N) Avg throughput (req/s)
  • 58. 40/42 System Identification 0 2000 4000 6000 8000 probability cve-2010-0426 0 2000 4000 6000 8000 cve-2015-3306 0 2000 4000 6000 8000 cve-2015-5602 0 2000 4000 6000 8000 cve-2016-10033 0 2000 4000 6000 8000 cwe-89 0 2000 4000 6000 8000 O probability cve-2017-7494 0 2000 4000 6000 8000 O cve-2014-6271 0 5000 10000 15000 20000 O ftp brute force 0 5000 10000 15000 20000 O ssh brute force 0 5000 10000 15000 20000 O telnet brute force intrusion no intrusion intrusion model normal operation model Empirical observation distributions b Z1(· | s), . . . , b Z10(· | s) as estimates of Z1, . . . , Z10. ▶ Empirical distributions based on M = 25, 000 samples. ▶ From the Glivenko-Cantelli theorem we know that b Z →a.s Z as M → ∞. ▶ Bound: P h Dkl(b Z(· | s) ∥ Z(· | s)) ≥ ϵ i ≤ 2−M ϵ−|O| ln(M+1) M = 2 −25·103 ϵ−2·103 ln(25·103+1) 25·103 = 2−5·103 (5ϵ−4 ln(25·103+1))
  • 59. 41/42 Comparison with State-of-the-art Intrusion-Tolerant Systems 0 0.5 1 5 15 25 ∞ 101 102 5 15 25 ∞ 0 0.1 0.2 5 15 25 ∞ 0 10 20 30 40 5 15 25 ∞ 0 0.5 1 5 15 25 ∞ 101 102 5 15 25 ∞ 0 0.1 0.2 5 15 25 ∞ 0 20 40 60 5 15 25 ∞ 0 0.5 1 5 15 25 ∞ 101 102 no-recovery periodic tolerance periodic-adaptive 5 15 25 ∞ 0 0.1 0.2 5 15 25 ∞ 0 50 100 5 15 25 ∞ Maximum time-to-recovery ∆R Maximum time-to-recovery ∆R Maximum time-to-recovery ∆R Maximum time-to-recovery ∆R Average availability T(A) Average time-to-recovery T(R) Average recovery frequency F(R) Average cost Ji + J N 1 = 3 N 1 = 6 N 1 = 9 Comparison between tolerance and the baselines; the columns represent: average availability (T(A) ), average time-to-recovery (T(R) ); average recovery frequency (F(R) ); and average cost Ji + J; the bars indicate the mean value from evaluations with 20 different random seeds; the error bars indicate the 95% confidence interval based on the Student’s t-distribution.
  • 60. 42/42 Conclusions ▶ We present tolerance: a novel control architecture for intrusion-tolerant systems which improves state-of-the-art. ▶ We prove that the optimal control strategies have threshold structures and design efficient algorithms for computing them. ▶ We evaluate tolerance in an emulation environment against 10 different types of network intrusions. . . . π1(b1) π2(b2) π3(b3) π4(b4) πN(bN) Belief transmissions Node controllers Replicated system System controller π(b1, . . . , bN) b1 b2 b3 b4 bN