SlideShare a Scribd company logo
1/27
Learning Security Strategies
through Game Play and Optimal Stopping
ICML 22’, Baltimore, 17/7-23/7 2022
Machine Learning for Cyber Security Workshop
Kim Hammar & Rolf Stadler
kimham@kth.se & stadler@kth.se
Division of Network and Systems Engineering
KTH Royal Institute of Technology
Mar 18, 2022
2/27
Use Case: Intrusion Prevention
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and active defense
I Has partial observability
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Attacker Clients
. . .
Defender
1 IPS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
3/27
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/27
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/27
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/27
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/27
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/27
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/27
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/27
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
3/27
The Intrusion Prevention Problem
20 40 60 80 100 120 140 160 180 200
0.5
1
t
b
20 40 60 80 100 120 140 160 180 200
50
100
150
200
t
Alerts
20 40 60 80 100 120 140 160 180 200
5
10
15
t
Logins
When to take a defensive action?
Which action to take?
4/27
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
4/27
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Reference points
Intrusion prevention milestones
4/27
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Audit logs
and manual
detection/prevention
(1980s)
Reference points
Intrusion prevention milestones
4/27
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Audit logs
and manual
detection/prevention
(1980s)
Rule-based
IDS/IPS
(1990s)
Reference points
Intrusion prevention milestones
4/27
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Audit logs
and manual
detection/prevention
(1980s)
Rule-based
IDS/IPS
(1990s)
Rule-based &
Statistical
IDS/IPS
(2000s)
Reference points
Intrusion prevention milestones
4/27
A Brief History of Intrusion Prevention
ARPANET
(1969)
Internet
(1980s)
Manual
detection/
prevention
(1980s)
Audit logs
and manual
detection/prevention
(1980s)
Rule-based
IDS/IPS
(1990s)
Rule-based &
Statistical
IDS/IPS
(2000s)
Research:
ML-based IDS
RL-based IPS
Control-based IPS
Computational game theory IPS
(2010-present)
Reference points
Intrusion prevention milestones
5/27
Our Approach for Learning Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
5/27
Our Approach for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
5/27
Our Approach for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
5/27
Our Approach for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
5/27
Our Approach for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
5/27
Our Approach for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
5/27
Our Approach for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
5/27
Our Approach for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
6/27
Outline
I Use Case & Approach:
I Use case: Intrusion prevention
I Approach: Emulation, simulation, and reinforcement learning
I Game-Theoretic Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Existence of Nash Equilibria
I Structural result: multi-threshold best responses
I Our Method for Learning Equilibrium Strategies
I Our method for emulating the target infrastructure
I Our system identification algorithm
I Our reinforcement learning algorithm: T-FP
I Results & Conclusion
I Numerical evaluation results, conclusion, and future work
6/27
Outline
I Use Case & Approach:
I Use case: Intrusion prevention
I Approach: Emulation, simulation, and reinforcement learning
I Game-Theoretic Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Existence of Nash Equilibria
I Structural result: multi-threshold best responses
I Our Method for Learning Equilibrium Strategies
I Our method for emulating the target infrastructure
I Our system identification algorithm
I Our reinforcement learning algorithm: T-FP
I Results & Conclusion
I Numerical evaluation results, conclusion, and future work
6/27
Outline
I Use Case & Approach:
I Use case: Intrusion prevention
I Approach: Emulation, simulation, and reinforcement learning
I Game-Theoretic Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Existence of Nash Equilibria
I Structural result: multi-threshold best responses
I Our Method for Learning Equilibrium Strategies
I Our method for emulating the target infrastructure
I Our system identification algorithm
I Our reinforcement learning algorithm: T-FP
I Results & Conclusion
I Numerical evaluation results, conclusion, and future work
6/27
Outline
I Use Case & Approach:
I Use case: Intrusion prevention
I Approach: Emulation, simulation, and reinforcement learning
I Game-Theoretic Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Existence of Nash Equilibria
I Structural result: multi-threshold best responses
I Our Method for Learning Equilibrium Strategies
I Our method for emulating the target infrastructure
I Our system identification algorithm
I Our reinforcement learning algorithm: T-FP
I Results & Conclusion
I Numerical evaluation results, conclusion, and future work
6/27
Outline
I Use Case & Approach:
I Use case: Intrusion prevention
I Approach: Emulation, simulation, and reinforcement learning
I Game-Theoretic Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Existence of Nash Equilibria
I Structural result: multi-threshold best responses
I Our Method for Learning Equilibrium Strategies
I Our method for emulating the target infrastructure
I Our system identification algorithm
I Our reinforcement learning algorithm: T-FP
I Results & Conclusion
I Numerical evaluation results, conclusion, and future work
7/27
The Optimal Stopping Game
I Defender:
I Has a pre-defined ordered list of
defensive measures:
1. Revoke user certificates
2. Blacklist IPs
3. Drop traffic that generates IPS alerts of priority
1 − 4
4. Block gateway
I Defender’s strategy decides when to
take each action
I Attacker:
I Has a pre-defined randomized
intrusion sequence of reconnaissance
and exploit commands:
1. TCP-SYN scan
2. CVE-2017-7494
3. CVE-2015-3306
4. CVE-2015-5602
5. SSH brute-force
6. . . .
I Attacker’s strategy decides when to
start/stop an intrusion
Attacker Clients
. . .
Defender
1 IPS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
7/27
The Optimal Stopping Game
I Defender:
I Has a pre-defined ordered list of
defensive measures:
1. Revoke user certificates
2. Blacklist IPs
3. Drop traffic that generates IPS alerts of priority
1 − 4
4. Block gateway
I Defender’s strategy decides when to
take each action
I Attacker:
I Has a pre-defined randomized
intrusion sequence of reconnaissance
and exploit commands:
1. TCP-SYN scan
2. CVE-2017-7494
3. CVE-2015-3306
4. CVE-2015-5602
5. SSH brute-force
6. . . .
I Attacker’s stratregy decides when to
start/stop an intrusion
Attacker Clients
. . .
Defender
1 IPS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
We analyze attacker/defender strategies using optimal stopping theory
8/27
Optimal Stopping Formulation of Intrusion Prevention
Attacker
Defender
t = 1
t
I The attacker’s stopping times τ2,1 and τ2,2 determine the
times to start/stop the intrusion
I During the intrusion, the attacker follows a fixed intrusion
strategy
I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the
times to take defensive actions
8/27
Optimal Stopping Formulation of Intrusion Prevention
Attacker
Defender
t = 1
τ2,1
t
Intrusion
I The attacker’s stopping times τ2,1 and τ2,2 determine the
times to start/stop the intrusion
I During the intrusion, the attacker follows a fixed intrusion
strategy
I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the
times to take defensive actions
8/27
Optimal Stopping Formulation of Intrusion Prevention
Attacker
Defender
t = 1
t = T
τ1,1 τ1,2 τ1,3
τ2,1
t
Prevented
Game episode
Intrusion
I The attacker’s stopping times τ2,1 and τ2,2 determine the
times to start/stop the intrusion
I During the intrusion, the attacker follows a fixed intrusion
strategy
I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the
times to take defensive actions
8/27
Optimal Stopping Formulation of Intrusion Prevention
Attacker
Defender
t = 1
t = T
τ1,1 τ1,2 τ1,3
τ2,1
t
Prevented
Game episode
Intrusion
I The attacker’s stopping times τ2,1, τ2,2, . . . determine the
times to start/stop the intrusion
I During the intrusion, the attacker follows a fixed intrusion
strategy
I The defender’s stopping times τ1,1, τ1,2, . . . determine the
times to update the IPS configuration
We model this game as a zero-sum partially observed stochastic game
9/27
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I Number of IPS Alerts ot ∈ O, defender
stops remaining lt ∈ {1, .., L}, ot is
drawn from r.v. O ∼ fO(·|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
lt = 1
a
(1)
t = S
p
r
e
v
e
n
t
e
d
w
.
p
φ
l
t
o
r
s
t
o
p
p
e
d
(
a
(
2
)
t
=
S
)
9/27
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I Number of IPS Alerts ot ∈ O, defender
stops remaining lt ∈ {1, .., L}, ot is
drawn from r.v. O ∼ fO(·|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
lt = 1
a
(1)
t = S
p
r
e
v
e
n
t
e
d
w
.
p
φ
l
t
o
r
s
t
o
p
p
e
d
(
a
(
2
)
t
=
S
)
9/27
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I Number of IPS Alerts ot ∈ O, defender
stops remaining lt ∈ {1, .., L}, ot is
drawn from r.v. O ∼ fO(·|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
lt = 1
a
(1)
t = S
p
r
e
v
e
n
t
e
d
w
.
p
φ
l
t
o
r
s
t
o
p
p
e
d
(
a
(
2
)
t
=
S
)
9/27
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I Number of IPS Alerts ot ∈ O, defender
stops remaining lt ∈ {1, .., L}, ot is
drawn from r.v. O ∼ fO(·|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
lt = 1
a
(1)
t = S
p
r
e
v
e
n
t
e
d
w
.
p
φ
l
t
o
r
s
t
o
p
p
e
d
(
a
(
2
)
t
=
S
)
9/27
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I Number of IPS Alerts ot ∈ O, defender
stops remaining lt ∈ {1, .., L}, ot is
drawn from r.v. O ∼ fO(·|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
lt = 1
a
(1)
t = S
p
r
e
v
e
n
t
e
d
w
.
p
φ
l
t
o
r
s
t
o
p
p
e
d
(
a
(
2
)
t
=
S
)
9/27
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I Number of IPS Alerts ot ∈ O, defender
stops remaining lt ∈ {1, .., L}, ot is
drawn from r.v. O ∼ fO(·|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
lt = 1
a
(1)
t = S
p
r
e
v
e
n
t
e
d
w
.
p
φ
l
t
o
r
s
t
o
p
p
e
d
(
a
(
2
)
t
=
S
)
9/27
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I Number of IPS Alerts ot ∈ O, defender
stops remaining lt ∈ {1, .., L}, ot is
drawn from r.v. O ∼ fO(·|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
lt = 1
a
(1)
t = S
p
r
e
v
e
n
t
e
d
w
.
p
φ
l
t
o
r
s
t
o
p
p
e
d
(
a
(
2
)
t
=
S
)
9/27
Partially Observed Stochastic Game
I Players: N = {1, 2} (Defender=1)
I States: Intrusion st ∈ {0, 1}, terminal ∅.
I Observations:
I IPS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t,
defender stops remaining lt ∈ {1, .., L},
fX (∆x1, . . . , ∆xM|st)
I Actions:
I A1 = A2 = {S, C}
I Rewards:
I Defender reward: security and service.
I Attacker reward: negative of defender
reward.
I Transition probabilities:
I Follows from game dynamics.
I Horizon:
I ∞
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
a
(2)
t = S
lt = 1
a
(1)
t = S
p
r
e
v
e
n
t
e
d
w
.
p
φ
l
t
o
r
s
t
o
p
p
e
d
(
a
(
2
)
t
=
S
)
10/27
One-Sided Partial Observability
I We assume that the attacker has perfect information. Only
the defender has partial information.
I The attacker’s view:
s1 s2 s3 . . . sn
o1 o2 o3 . . . o5
I The defender’s view:
s1 s2 s3 . . . sn
o1 o2 o3 . . . o5
10/27
One-Sided Partial Observability
I We assume that the attacker has perfect information. Only
the defender has partial information.
I The attacker’s view:
s1 s2 s3 . . . sn
o1 o2 o3 . . . o5
I The defender’s view:
s1 s2 s3 . . . sn
o1 o2 o3 . . . o5
I Makes it tractable to compute the defender’s belief
b
(1)
t (st) = P[st|ht] (avoid nested beliefs)
11/27
Outline
I Use Case & Approach:
I Use case: Intrusion prevention
I Approach: Emulation, simulation, and reinforcement learning
I Game-Theoretic Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Existence of Nash Equilibria
I Structural result: multi-threshold best responses
I Our Method for Learning Equilibrium Strategies
I Our method for emulating the target infrastructure
I Our system identification algorithm
I Our reinforcement learning algorithm: T-FP
I Results & Conclusion
I Numerical evaluation results, conclusion, and future work
11/27
Outline
I Use Case & Approach:
I Use case: Intrusion prevention
I Approach: Emulation, simulation, and reinforcement learning
I Game-Theoretic Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Existence of Nash Equilibria
I Structural result: multi-threshold best responses
I Our Method for Learning Equilibrium Strategies
I Our method for emulating the target infrastructure
I Our system identification algorithm
I Our reinforcement learning algorithm: T-FP
I Results & Conclusion
I Numerical evaluation results, conclusion, and future work
12/27
Game Analysis
I Defender strategy is of the form: π1,l : B → ∆(A1)
I Attacker strategy is of the form: π2,l : S × B → ∆(A2)
I Defender and attacker objectives:
J1(π1,l , π2,l ) = E(π1,l ,π2,l )
" ∞
X
t=1
γt−1
Rl (st, at)
#
J2(π1,l , π2,l ) = −J1
I Best response correspondences:
B1(π2,l ) = arg max
π1,l ∈Π1
J1(π1,l , π2,l )
B2(π1,l ) = arg max
π2,l ∈Π2
J2(π1,l , π2,l )
I Nash equilibrium (π∗
1,l , π∗
2,l ):
π∗
1,l ∈ B1(π∗
2,l ) and π∗
2,l ∈ B2(π∗
1,l ) (1)
12/27
Game Analysis
I Defender strategy is of the form: π1,l : B → ∆(A1)
I Attacker strategy is of the form: π2,l : S × B → ∆(A2)
I Defender and attacker objectives:
J1(π1,l , π2,l ) = E(π1,l ,π2,l )
" ∞
X
t=1
γt−1
Rl (st, at)
#
J2(π1,l , π2,l ) = −J1
I Best response correspondences:
B1(π2,l ) = arg max
π1,l ∈Π1
J1(π1,l , π2,l )
B2(π1,l ) = arg max
π2,l ∈Π2
J2(π1,l , π2,l )
I Nash equilibrium (π∗
1,l , π∗
2,l ):
π∗
1,l ∈ B1(π∗
2,l ) and π∗
2,l ∈ B2(π∗
1,l ) (2)
12/27
Game Analysis
I Defender strategy is of the form: π1,l : B → ∆(A1)
I Attacker strategy is of the form: π2,l : S × B → ∆(A2)
I Defender and attacker objectives:
J1(π1,l , π2,l ) = E(π1,l ,π2,l )
" ∞
X
t=1
γt−1
Rl (st, at)
#
J2(π1,l , π2,l ) = −J1
I Best response correspondences:
B1(π2,l ) = arg max
π1,l ∈Π1
J1(π1,l , π2,l )
B2(π1,l ) = arg max
π2,l ∈Π2
J2(π1,l , π2,l )
I Nash equilibrium (π∗
1,l , π∗
2,l ):
π∗
1,l ∈ B1(π∗
2,l ) and π∗
2,l ∈ B2(π∗
1,l ) (3)
12/27
Game Analysis
I Defender strategy is of the form: π1,l : B → ∆(A1)
I Attacker strategy is of the form: π2,l : S × B → ∆(A2)
I Defender and attacker objectives:
J1(π1,l , π2,l ) = E(π1,l ,π2,l )
" ∞
X
t=1
γt−1
Rl (st, at)
#
J2(π1,l , π2,l ) = −J1
I Best response correspondences:
B1(π2,l ) = arg max
π1,l ∈Π1
J1(π1,l , π2,l )
B2(π1,l ) = arg max
π2,l ∈Π2
J2(π1,l , π2,l )
I Nash equilibrium (π∗
1,l , π∗
2,l ):
π∗
1,l ∈ B1(π∗
2,l ) and π∗
2,l ∈ B2(π∗
1,l )
13/27
Game Analysis
Theorem
Given the one-sided POSG Γ with L ≥ 1, the following holds.
(A) Γ has a mixed Nash equilibrium. Further, Γ has a pure Nash
equilibrium when s = 0 ⇐⇒ b(1) = 0.
(B) Given any attacker strategy π2,l ∈ Π2, if fO|s is totally positive
of order 2, there exist values α̃1 ≥ α̃2 ≥ . . . ≥ α̃L ∈ [0, 1] and
a defender best response strategy π̃1,l ∈ B1(π2,l ) that satisfies:
π̃1,l (b(1)) = S ⇐⇒ b(1) ≥ α̃l l ∈ 1, . . . , L (4)
(C) Given a defender strategy π1,l ∈ Π1, where π1,l (S|b(1)) is
non-decreasing in b(1) and π1,l (S|1) = 1, there exist values
β̃0,1, β̃1,1, . . ., β̃0,L, β̃1,L ∈ [0, 1] and a best response strategy
π̃2,l ∈ B2(π1,l ) of the attacker that satisfies:
π̃2,l (0, b(1)) = C ⇐⇒ π1,l (S|b(1)) ≥ β̃0,l (5)
π̃2,l (1, b(1)) = S ⇐⇒ π1,l (S|b(1)) ≥ β̃1,l (6)
13/27
Game Analysis
Theorem
Given the one-sided POSG Γ with L ≥ 1, the following holds.
(A) Γ has a mixed Nash equilibrium. Further, Γ has a pure Nash
equilibrium when s = 0 ⇐⇒ b(1) = 0.
(B) Given any attacker strategy π2,l ∈ Π2, if fO|s is totally positive
of order 2, there exist values α̃1 ≥ α̃2 ≥ . . . ≥ α̃L ∈ [0, 1] and
a defender best response strategy π̃1,l ∈ B1(π2,l ) that satisfies:
π̃1,l (b(1)) = S ⇐⇒ b(1) ≥ α̃l l ∈ 1, . . . , L (7)
(C) Given a defender strategy π1,l ∈ Π1, where π1,l (S|b(1)) is
non-decreasing in b(1) and π1,l (S|1) = 1, there exist values
β̃0,1, β̃1,1, . . ., β̃0,L, β̃1,L ∈ [0, 1] and a best response strategy
π̃2,l ∈ B2(π1,l ) of the attacker that satisfies:
π̃2,l (0, b(1)) = C ⇐⇒ π1,l (S|b(1)) ≥ β̃0,l (8)
π̃2,l (1, b(1)) = S ⇐⇒ π1,l (S|b(1)) ≥ β̃1,l (9)
13/27
Game Analysis
Theorem
Given the one-sided POSG Γ with L ≥ 1, the following holds.
(A) Γ has a mixed Nash equilibrium. Further, Γ has a pure Nash
equilibrium when s = 0 ⇐⇒ b(1) = 0.
(B) Given any attacker strategy π2,l ∈ Π2, if fO|s is totally positive
of order 2, there exist values α̃1 ≥ α̃2 ≥ . . . ≥ α̃L ∈ [0, 1] and
a defender best response strategy π̃1,l ∈ B1(π2,l ) that satisfies:
π̃1,l (b(1)) = S ⇐⇒ b(1) ≥ α̃l l ∈ 1, . . . , L (10)
(C) Given a defender strategy π1,l ∈ Π1, where π1,l (S|b(1)) is
non-decreasing in b(1) and π1,l (S|1) = 1, there exist values
β̃0,1, β̃1,1, . . ., β̃0,L, β̃1,L ∈ [0, 1] and a best response strategy
π̃2,l ∈ B2(π1,l ) of the attacker that satisfies:
π̃2,l (0, b(1)) = C ⇐⇒ π1,l (S|b(1)) ≥ β̃0,l (11)
π̃2,l (1, b(1)) = S ⇐⇒ π1,l (S|b(1)) ≥ β̃1,l (12)
14/27
Structure of Best Response Strategies
b(1)
0 1
14/27
Structure of Best Response Strategies
b(1)
0 1
S
(1)
1
α̃1
14/27
Structure of Best Response Strategies
b(1)
0 1
S
(1)
1
S
(1)
2
α̃1
α̃2
14/27
Structure of Best Response Strategies
b(1)
0 1
S
(1)
1
S
(1)
2
.
.
.
S
(1)
L
α̃1
α̃2
α̃L . . .
14/27
Structure of Best Response Strategies
b(1)
0 1
S
(1)
1,π2,l
S
(1)
2,π2,l
.
.
.
S
(1)
L,π2,l
α̃1
α̃2
α̃L . . .
b(1)
0 1
S
(2)
1,1,π1,l
S
(2)
1,L,π1,l
β̃1,1
β̃1,L
β̃0,1
. . . β̃0,L
. . .
S
(2)
0,1,π1,l
S
(2)
0,L,π1,l
15/27
Outline
I Use Case & Approach:
I Use case: Intrusion prevention
I Approach: Emulation, simulation, and reinforcement learning
I Game-Theoretic Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Existence of Nash Equilibria
I Structural result: multi-threshold best responses
I Our Method for Learning Equilibrium Strategies
I Our method for emulating the target infrastructure
I Our system identification algorithm
I Our reinforcement learning algorithm: T-FP
I Results & Conclusion
I Numerical evaluation results, conclusion, and future work
15/27
Outline
I Use Case & Approach:
I Use case: Intrusion prevention
I Approach: Emulation, simulation, and reinforcement learning
I Game-Theoretic Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Existence of Nash Equilibria
I Structural result: multi-threshold best responses
I Our Method for Learning Equilibrium Strategies
I Our method for emulating the target infrastructure
I Our system identification algorithm
I Our reinforcement learning algorithm: T-FP
I Results & Conclusion
I Numerical evaluation results, conclusion, and future work
16/27
Our Method for Learning Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Target
Infrastructure
Model Creation &
System Identification
Strategy Mapping
π
Selective
Replication
Strategy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Strategy evaluation &
Model estimation
Automation &
Self-learning systems
17/27
Emulating the Target Infrastructure
I Emulate hosts with docker containers
I Emulate IPS and vulnerabilities with
software
I Network isolation and traffic shaping
through NetEm in the Linux kernel
I Enforce resource constraints using
cgroups.
I Emulate client arrivals with Poisson
process
I Internal connections are full-duplex
& loss-less with bit capacities of 1000
Mbit/s
I External connections are full-duplex
with bit capacities of 100 Mbit/s &
0.1% packet loss in normal operation
and random bursts of 1% packet loss
Attacker Clients
. . .
Defender
1 IPS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
18/27
System Identification: Instantiating the Game Model based
on Data from the Emulation
ˆ
f
O
(o
t
|0)
Probability distribution of # IPS alerts weighted by priority ot
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
ˆ
f
O
(o
t
|1)
Fitted model Distribution st = 0 Distribution st = 1
I We fit a Gaussian mixture distribution ˆ
fO as an estimate of fO
in the target infrastructure
I For each state s, we obtain the conditional distribution ˆ
fO|s
through expectation-maximization
19/27
Our Reinforcement Learning Approach
I We learn a Nash equilibrium (π∗
1,l,θ(1) , π∗
2,l,θ(2) ) through
fictitious self-play.
I In each iteration:
1. Learn a best response strategy of the defender by solving a
POMDP π̃1,l,θ(1) ∈ B1(π2,l,θ(2) ).
2. Learn a best response strategy of the attacker by solving an
MDP π̃2,l,θ(2) ∈ B2(π1,l,θ(1) ).
3. Store the best response strategies in two buffers Θ1, Θ2
4. Update strategies to be the average of the stored strategies
Strategy π2
Strategy π1
Strategy π0
2
Strategy π0
1
. . . Strategy π∗
2
Strategy π∗
1
Self-play process
(Pseudo-code is available in the paper)
20/27
Our Reinforcement Learning Algorithm for Learning
Best-Response Threshold Strategies
I We use the structural result that threshold best response
strategies exist (Theorem 1) to design an efficient
reinforcement learning algorithm to learn best response
strategies.
I We seek to learn:
I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1]
I 2L thresholds of the attacker, β̃0,1, β̃1,1, . . . , β̃0,L, β̃1,L ∈ [0, 1]
I We learn these thresholds iteratively through Robbins and
Monro’s stochastic approximation algorithm.1
Monte-Caro
Simulation
Stochastic
Approximation
(RL)
θn+1
ˆ
θ∗
Performance estimate
Ĵ(θn)
θ1
θn
1
Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical
Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url:
https://doi.org/10.1214/aoms/1177729586.
20/27
Our Reinforcement Learning Algorithm for Learning
Best-Response Threshold Strategies
I We use the structural result that threshold best response
strategies exist (Theorem 1) to design an efficient
reinforcement learning algorithm to learn best response
strategies.
I We seek to learn:
I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1]
I 2L thresholds of the attacker, β̃1, β̃2, . . . , β̃L ∈ [0, 1]
I We learn these thresholds iteratively through Robbins and
Monro’s stochastic approximation algorithm.2
Monte-Caro
Simulation
Stochastic
Approximation
(RL)
θn+1
ˆ
θ∗
Performance estimate
Ĵ(θn)
θ1
θn
2
Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical
Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url:
https://doi.org/10.1214/aoms/1177729586.
20/27
Our Reinforcement Learning Algorithm for Learning
Best-Response Threshold Strategies
I We use the structural result that threshold best response
strategies exist (Theorem 1) to design an efficient
reinforcement learning algorithm to learn best response
strategies.
I We seek to learn:
I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1]
I 2L thresholds of the attacker, β̃1, β̃2, . . . , β̃L ∈ [0, 1]
I We learn these thresholds iteratively through Robbins and
Monro’s stochastic approximation algorithm.3
Monte-Caro
Simulation
Stochastic
Approximation
(RL)
θn+1
ˆ
θ∗
Performance estimate
Ĵ(θn)
θ1
θn
3
Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical
Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url:
https://doi.org/10.1214/aoms/1177729586.
21/27
Our Reinforcement Learning Algorithm: T-FP
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:

ˆ
∇θ
(i)
n
Ji (θ(i)
n )

k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
21/27
Our Reinforcement Learning Algorithm: T-FP
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:

ˆ
∇θ
(i)
n
Ji (θ(i)
n )

k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
21/27
Our Reinforcement Learning Algorithm: T-FP
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:

ˆ
∇θ
(i)
n
Ji (θ(i)
n )

k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
21/27
Our Reinforcement Learning Algorithm: T-FP
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:

ˆ
∇θ
(i)
n
Ji (θ(i)
n )

k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
21/27
Our Reinforcement Learning Algorithm: T-FP
1. We learn the thresholds through simulation.
2. For each iteration n ∈ {1, 2, . . .}, we perturb θ
(i)
n to obtain
θ
(i)
n + cn∆n and θ
(i)
n − cn∆n.
3. Then, we run two MDP or POMDP episodes
4. We then use the obtained episode outcomes Ĵi (θ
(i)
n + cn∆n)
and Ĵi (θ
(i)
n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the
Simultaneous Perturbation Stochastic Approximation (SPSA)
gradient estimator4:

ˆ
∇θ
(i)
n
Ji (θ(i)
n )

k
=
Ĵi (θ
(i)
n + cn∆n) − Ĵi (θ
(i)
n − cn∆n)
2cn(∆n)k
5. Next, we use the estimated gradient and update the vector of
thresholds through the stochastic approximation update:
θ
(i)
n+1 = θ(i)
n + an
ˆ
∇θ
(i)
n
Ji (θ(i)
n )
4
James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient
Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
22/27
Outline
I Use Case  Approach:
I Use case: Intrusion prevention
I Approach: Emulation, simulation, and reinforcement learning
I Game-Theoretic Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Existence of Nash Equilibria
I Structural result: multi-threshold best responses
I Our Method for Learning Equilibrium Strategies
I Our method for emulating the target infrastructure
I Our system identification algorithm
I Our reinforcement learning algorithm: T-FP
I Results  Conclusion
I Numerical evaluation results, conclusion, and future work
22/27
Outline
I Use Case  Approach:
I Use case: Intrusion prevention
I Approach: Emulation, simulation, and reinforcement learning
I Game-Theoretic Model of The Use Case
I Intrusion prevention as an optimal stopping problem
I Partially observed stochastic game
I Game Analysis and Structure of (π̃1, π̃2)
I Existence of Nash Equilibria
I Structural result: multi-threshold best responses
I Our Method for Learning Equilibrium Strategies
I Our method for emulating the target infrastructure
I Our system identification algorithm
I Our reinforcement learning algorithm: T-FP
I Results  Conclusion
I Numerical evaluation results, conclusion, and future work
23/27
Evaluation Results: Learning Nash Equilibrium Strategies
0 20 40 60 80 100
# training iterations
0
1
2
3
Exploitability
0 20 40 60 80 100
# training iterations
−5.0
−2.5
0.0
2.5
5.0
Defender reward per episode
0 20 40 60 80 100
# training iterations
0
1
2
3
Intrusion length
(π1,l, π2,l) emulation (π1,l, π2,l) simulation Snort IPS ot ≥ 1 upper bound
Learning curves from the self-play process with T-FP; the red curve
show simulation results and the blue curves show emulation results; the
purple, orange, and black curves relate to baseline strategies; the curves
indicate the mean and the 95% confidence interval over four training
runs with different random seeds.
24/27
Evaluation Results: Converge Rates
0 10 20 30 40 50 60
running time (min)
0
2
4
Exploitability
0 10 20 30 40 50 60
running time (min)
0
250
500
750
Approximation error (gap)
55.0 57.5 60.0
7.5
10.0
T-FP NFSP HSVI
Comparison between T-FP and two baseline algorithms: NFSP and
HSVI; the red curve relate to T-FP; the blue curve to NFSP; the purple
curve to HSVI; the left plot shows the approximate exploitability metric
and the right plot shows the HSVI approximation error5
.
5
Karel Horák, Branislav Bošanský, and Michal Pěchouček. “Heuristic Search Value Iteration for One-Sided
Partially Observable Stochastic Games”. In: Proceedings of the AAAI Conference on Artificial Intelligence (2017).
url: https://ojs.aaai.org/index.php/AAAI/article/view/10597.
25/27
Evaluation Results: Inspection of Learned Strategies
0.5
b(1) ∈ B
0.0
0.5
1.0
π2,l(S|0, π1(S|b(1)))
0.5
b(1) ∈ B
π2,l(S|1, π1(S|b(1)))
0.5
b(1) ∈ B
π1,l(S|b(1))
l = 7 l = 4 l = 1
Probability of the stop action S by the learned equilibrium strategies in
function of b(1) and l; the left and middle plots show the attacker’s
stopping probability when s = 0 and s = 1, respectively; the right plot
shows the defender’s stopping probability.
26/27
Evaluation Results: Inspection of Learned Game Values
V̂ ∗
7 (b(1)) V̂ ∗
1 (b(1))
1 b(1)
0
−0.25
The value function ˆ
V ∗
l (b(1)) computed through the HSVI algorithm
with approximation error 4; the blue and red curves relate to l = 7 and
l = 1, respectively.
27/27
Conclusions  Future Work
I Conclusions:
I We develop a method to automatically learn security strategies
I (1) emulation system; (2) system identification; (3) simulation system; and (4)
reinforcement learning.
I We apply the method to an intrusion prevention use case
I We formulate intrusion prevention as a stopping game
I We present a Partially Observed Stochastic Game of the use case
I We present a POMDP model of the defender’s problem
I We present a MDP model of the attacker’s problem
I We apply the stopping theory to establish structural results of the best response strategies
and existence of Nash equilibria.
I We show numerical results in realistic emulation environment
I We show that our method outperforms two state-of-the-art methods
I Our research plans:
I Extend the model (remove limiting assumptions)
I Less restrictions on defender

More Related Content

Similar to Learning Security Strategies through Game Play and Optimal Stopping

Self-Learning Systems for Cyber Security
Self-Learning Systems for Cyber SecuritySelf-Learning Systems for Cyber Security
Self-Learning Systems for Cyber Security
Kim Hammar
 
Automated Intrusion Response - CDIS Spring Conference 2024
Automated Intrusion Response - CDIS Spring Conference 2024Automated Intrusion Response - CDIS Spring Conference 2024
Automated Intrusion Response - CDIS Spring Conference 2024
Kim Hammar
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security Automation
Kim Hammar
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Kim Hammar
 
Learning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal StoppingLearning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal Stopping
Kim Hammar
 
Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via Decomposition
Kim Hammar
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Kim Hammar
 
Learning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal StoppingLearning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal Stopping
Kim Hammar
 
Security of Machine Learning
Security of Machine LearningSecurity of Machine Learning
Security of Machine Learning
Institute of Contemporary Sciences
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal Stopping
Kim Hammar
 
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
Kim Hammar
 
Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.
Kim Hammar
 
ITD BSides PDX Slides
ITD BSides PDX SlidesITD BSides PDX Slides
ITD BSides PDX Slides
EricGoldstrom
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion Response
Kim Hammar
 
The Art of CTF
The Art of CTFThe Art of CTF
The Art of CTF
Sylvain Martinez
 
Intrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingIntrusion Response through Optimal Stopping
Intrusion Response through Optimal Stopping
Kim Hammar
 
Research of Intrusion Preventio System based on Snort
Research of Intrusion Preventio System based on SnortResearch of Intrusion Preventio System based on Snort
Research of Intrusion Preventio System based on Snort
Francis Yang
 
Counter Counter-Measures against Stealth Attacks on State Estimation in Smart...
Counter Counter-Measures against Stealth Attacks on State Estimation in Smart...Counter Counter-Measures against Stealth Attacks on State Estimation in Smart...
Counter Counter-Measures against Stealth Attacks on State Estimation in Smart...
Sung-Fu Han
 
Irjet v7 i3475
Irjet v7 i3475Irjet v7 i3475
Irjet v7 i3475
aissmsblogs
 
IRJET - Detection of False Data Injection Attacks using K-Means Clusterin...
IRJET -  	  Detection of False Data Injection Attacks using K-Means Clusterin...IRJET -  	  Detection of False Data Injection Attacks using K-Means Clusterin...
IRJET - Detection of False Data Injection Attacks using K-Means Clusterin...
IRJET Journal
 

Similar to Learning Security Strategies through Game Play and Optimal Stopping (20)

Self-Learning Systems for Cyber Security
Self-Learning Systems for Cyber SecuritySelf-Learning Systems for Cyber Security
Self-Learning Systems for Cyber Security
 
Automated Intrusion Response - CDIS Spring Conference 2024
Automated Intrusion Response - CDIS Spring Conference 2024Automated Intrusion Response - CDIS Spring Conference 2024
Automated Intrusion Response - CDIS Spring Conference 2024
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security Automation
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Learning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal StoppingLearning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal Stopping
 
Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via Decomposition
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Learning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal StoppingLearning Intrusion Prevention Policies Through Optimal Stopping
Learning Intrusion Prevention Policies Through Optimal Stopping
 
Security of Machine Learning
Security of Machine LearningSecurity of Machine Learning
Security of Machine Learning
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal Stopping
 
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
Learning Intrusion Prevention Policies through Optimal Stopping - CNSM2021
 
Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.
 
ITD BSides PDX Slides
ITD BSides PDX SlidesITD BSides PDX Slides
ITD BSides PDX Slides
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion Response
 
The Art of CTF
The Art of CTFThe Art of CTF
The Art of CTF
 
Intrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingIntrusion Response through Optimal Stopping
Intrusion Response through Optimal Stopping
 
Research of Intrusion Preventio System based on Snort
Research of Intrusion Preventio System based on SnortResearch of Intrusion Preventio System based on Snort
Research of Intrusion Preventio System based on Snort
 
Counter Counter-Measures against Stealth Attacks on State Estimation in Smart...
Counter Counter-Measures against Stealth Attacks on State Estimation in Smart...Counter Counter-Measures against Stealth Attacks on State Estimation in Smart...
Counter Counter-Measures against Stealth Attacks on State Estimation in Smart...
 
Irjet v7 i3475
Irjet v7 i3475Irjet v7 i3475
Irjet v7 i3475
 
IRJET - Detection of False Data Injection Attacks using K-Means Clusterin...
IRJET -  	  Detection of False Data Injection Attacks using K-Means Clusterin...IRJET -  	  Detection of False Data Injection Attacks using K-Means Clusterin...
IRJET - Detection of False Data Injection Attacks using K-Means Clusterin...
 

More from Kim Hammar

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
Kim Hammar
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTH
Kim Hammar
 
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Kim Hammar
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Kim Hammar
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.
Kim Hammar
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Kim Hammar
 
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
Kim Hammar
 
Reinforcement Learning Algorithms for Adaptive Cyber Defense against Heartbleed
Reinforcement Learning Algorithms for Adaptive Cyber Defense against HeartbleedReinforcement Learning Algorithms for Adaptive Cyber Defense against Heartbleed
Reinforcement Learning Algorithms for Adaptive Cyber Defense against Heartbleed
Kim Hammar
 
Självlärande system för cybersäkerhet
Självlärande system för cybersäkerhetSjälvlärande system för cybersäkerhet
Självlärande system för cybersäkerhet
Kim Hammar
 
MuZero - ML + Security Reading Group
MuZero - ML + Security Reading GroupMuZero - ML + Security Reading Group
MuZero - ML + Security Reading Group
Kim Hammar
 

More from Kim Hammar (10)

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTH
 
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
 
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
A Game Theoretic Analysis of Intrusion Detection in Access Control Systems - ...
 
Reinforcement Learning Algorithms for Adaptive Cyber Defense against Heartbleed
Reinforcement Learning Algorithms for Adaptive Cyber Defense against HeartbleedReinforcement Learning Algorithms for Adaptive Cyber Defense against Heartbleed
Reinforcement Learning Algorithms for Adaptive Cyber Defense against Heartbleed
 
Självlärande system för cybersäkerhet
Självlärande system för cybersäkerhetSjälvlärande system för cybersäkerhet
Självlärande system för cybersäkerhet
 
MuZero - ML + Security Reading Group
MuZero - ML + Security Reading GroupMuZero - ML + Security Reading Group
MuZero - ML + Security Reading Group
 

Recently uploaded

Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 

Recently uploaded (20)

Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 

Learning Security Strategies through Game Play and Optimal Stopping

  • 1. 1/27 Learning Security Strategies through Game Play and Optimal Stopping ICML 22’, Baltimore, 17/7-23/7 2022 Machine Learning for Cyber Security Workshop Kim Hammar & Rolf Stadler kimham@kth.se & stadler@kth.se Division of Network and Systems Engineering KTH Royal Institute of Technology Mar 18, 2022
  • 2. 2/27 Use Case: Intrusion Prevention I A Defender owns an infrastructure I Consists of connected components I Components run network services I Defender defends the infrastructure by monitoring and active defense I Has partial observability I An Attacker seeks to intrude on the infrastructure I Has a partial view of the infrastructure I Wants to compromise specific components I Attacks by reconnaissance, exploitation and pivoting Attacker Clients . . . Defender 1 IPS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31
  • 3. 3/27 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 4. 3/27 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 5. 3/27 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 6. 3/27 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 7. 3/27 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 8. 3/27 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 9. 3/27 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 10. 3/27 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins
  • 11. 3/27 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins When to take a defensive action? Which action to take?
  • 12. 4/27 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s)
  • 13. 4/27 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Reference points Intrusion prevention milestones
  • 14. 4/27 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Audit logs and manual detection/prevention (1980s) Reference points Intrusion prevention milestones
  • 15. 4/27 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Audit logs and manual detection/prevention (1980s) Rule-based IDS/IPS (1990s) Reference points Intrusion prevention milestones
  • 16. 4/27 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Audit logs and manual detection/prevention (1980s) Rule-based IDS/IPS (1990s) Rule-based & Statistical IDS/IPS (2000s) Reference points Intrusion prevention milestones
  • 17. 4/27 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Audit logs and manual detection/prevention (1980s) Rule-based IDS/IPS (1990s) Rule-based & Statistical IDS/IPS (2000s) Research: ML-based IDS RL-based IPS Control-based IPS Computational game theory IPS (2010-present) Reference points Intrusion prevention milestones
  • 18. 5/27 Our Approach for Learning Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Strategy Mapping π Selective Replication Strategy Implementation π Simulation System Reinforcement Learning & Generalization Strategy evaluation & Model estimation Automation & Self-learning systems
  • 19. 5/27 Our Approach for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Strategy Mapping π Selective Replication Strategy Implementation π Simulation System Reinforcement Learning & Generalization Strategy evaluation & Model estimation Automation & Self-learning systems
  • 20. 5/27 Our Approach for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Strategy Mapping π Selective Replication Strategy Implementation π Simulation System Reinforcement Learning & Generalization Strategy evaluation & Model estimation Automation & Self-learning systems
  • 21. 5/27 Our Approach for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Strategy Mapping π Selective Replication Strategy Implementation π Simulation System Reinforcement Learning & Generalization Strategy evaluation & Model estimation Automation & Self-learning systems
  • 22. 5/27 Our Approach for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Strategy Mapping π Selective Replication Strategy Implementation π Simulation System Reinforcement Learning & Generalization Strategy evaluation & Model estimation Automation & Self-learning systems
  • 23. 5/27 Our Approach for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Strategy Mapping π Selective Replication Strategy Implementation π Simulation System Reinforcement Learning & Generalization Strategy evaluation & Model estimation Automation & Self-learning systems
  • 24. 5/27 Our Approach for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Strategy Mapping π Selective Replication Strategy Implementation π Simulation System Reinforcement Learning & Generalization Strategy evaluation & Model estimation Automation & Self-learning systems
  • 25. 5/27 Our Approach for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Strategy Mapping π Selective Replication Strategy Implementation π Simulation System Reinforcement Learning & Generalization Strategy evaluation & Model estimation Automation & Self-learning systems
  • 26. 6/27 Outline I Use Case & Approach: I Use case: Intrusion prevention I Approach: Emulation, simulation, and reinforcement learning I Game-Theoretic Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Existence of Nash Equilibria I Structural result: multi-threshold best responses I Our Method for Learning Equilibrium Strategies I Our method for emulating the target infrastructure I Our system identification algorithm I Our reinforcement learning algorithm: T-FP I Results & Conclusion I Numerical evaluation results, conclusion, and future work
  • 27. 6/27 Outline I Use Case & Approach: I Use case: Intrusion prevention I Approach: Emulation, simulation, and reinforcement learning I Game-Theoretic Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Existence of Nash Equilibria I Structural result: multi-threshold best responses I Our Method for Learning Equilibrium Strategies I Our method for emulating the target infrastructure I Our system identification algorithm I Our reinforcement learning algorithm: T-FP I Results & Conclusion I Numerical evaluation results, conclusion, and future work
  • 28. 6/27 Outline I Use Case & Approach: I Use case: Intrusion prevention I Approach: Emulation, simulation, and reinforcement learning I Game-Theoretic Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Existence of Nash Equilibria I Structural result: multi-threshold best responses I Our Method for Learning Equilibrium Strategies I Our method for emulating the target infrastructure I Our system identification algorithm I Our reinforcement learning algorithm: T-FP I Results & Conclusion I Numerical evaluation results, conclusion, and future work
  • 29. 6/27 Outline I Use Case & Approach: I Use case: Intrusion prevention I Approach: Emulation, simulation, and reinforcement learning I Game-Theoretic Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Existence of Nash Equilibria I Structural result: multi-threshold best responses I Our Method for Learning Equilibrium Strategies I Our method for emulating the target infrastructure I Our system identification algorithm I Our reinforcement learning algorithm: T-FP I Results & Conclusion I Numerical evaluation results, conclusion, and future work
  • 30. 6/27 Outline I Use Case & Approach: I Use case: Intrusion prevention I Approach: Emulation, simulation, and reinforcement learning I Game-Theoretic Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Existence of Nash Equilibria I Structural result: multi-threshold best responses I Our Method for Learning Equilibrium Strategies I Our method for emulating the target infrastructure I Our system identification algorithm I Our reinforcement learning algorithm: T-FP I Results & Conclusion I Numerical evaluation results, conclusion, and future work
  • 31. 7/27 The Optimal Stopping Game I Defender: I Has a pre-defined ordered list of defensive measures: 1. Revoke user certificates 2. Blacklist IPs 3. Drop traffic that generates IPS alerts of priority 1 − 4 4. Block gateway I Defender’s strategy decides when to take each action I Attacker: I Has a pre-defined randomized intrusion sequence of reconnaissance and exploit commands: 1. TCP-SYN scan 2. CVE-2017-7494 3. CVE-2015-3306 4. CVE-2015-5602 5. SSH brute-force 6. . . . I Attacker’s strategy decides when to start/stop an intrusion Attacker Clients . . . Defender 1 IPS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31
  • 32. 7/27 The Optimal Stopping Game I Defender: I Has a pre-defined ordered list of defensive measures: 1. Revoke user certificates 2. Blacklist IPs 3. Drop traffic that generates IPS alerts of priority 1 − 4 4. Block gateway I Defender’s strategy decides when to take each action I Attacker: I Has a pre-defined randomized intrusion sequence of reconnaissance and exploit commands: 1. TCP-SYN scan 2. CVE-2017-7494 3. CVE-2015-3306 4. CVE-2015-5602 5. SSH brute-force 6. . . . I Attacker’s stratregy decides when to start/stop an intrusion Attacker Clients . . . Defender 1 IPS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 We analyze attacker/defender strategies using optimal stopping theory
  • 33. 8/27 Optimal Stopping Formulation of Intrusion Prevention Attacker Defender t = 1 t I The attacker’s stopping times τ2,1 and τ2,2 determine the times to start/stop the intrusion I During the intrusion, the attacker follows a fixed intrusion strategy I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the times to take defensive actions
  • 34. 8/27 Optimal Stopping Formulation of Intrusion Prevention Attacker Defender t = 1 τ2,1 t Intrusion I The attacker’s stopping times τ2,1 and τ2,2 determine the times to start/stop the intrusion I During the intrusion, the attacker follows a fixed intrusion strategy I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the times to take defensive actions
  • 35. 8/27 Optimal Stopping Formulation of Intrusion Prevention Attacker Defender t = 1 t = T τ1,1 τ1,2 τ1,3 τ2,1 t Prevented Game episode Intrusion I The attacker’s stopping times τ2,1 and τ2,2 determine the times to start/stop the intrusion I During the intrusion, the attacker follows a fixed intrusion strategy I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the times to take defensive actions
  • 36. 8/27 Optimal Stopping Formulation of Intrusion Prevention Attacker Defender t = 1 t = T τ1,1 τ1,2 τ1,3 τ2,1 t Prevented Game episode Intrusion I The attacker’s stopping times τ2,1, τ2,2, . . . determine the times to start/stop the intrusion I During the intrusion, the attacker follows a fixed intrusion strategy I The defender’s stopping times τ1,1, τ1,2, . . . determine the times to update the IPS configuration We model this game as a zero-sum partially observed stochastic game
  • 37. 9/27 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I Number of IPS Alerts ot ∈ O, defender stops remaining lt ∈ {1, .., L}, ot is drawn from r.v. O ∼ fO(·|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S lt = 1 a (1) t = S p r e v e n t e d w . p φ l t o r s t o p p e d ( a ( 2 ) t = S )
  • 38. 9/27 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I Number of IPS Alerts ot ∈ O, defender stops remaining lt ∈ {1, .., L}, ot is drawn from r.v. O ∼ fO(·|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S lt = 1 a (1) t = S p r e v e n t e d w . p φ l t o r s t o p p e d ( a ( 2 ) t = S )
  • 39. 9/27 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I Number of IPS Alerts ot ∈ O, defender stops remaining lt ∈ {1, .., L}, ot is drawn from r.v. O ∼ fO(·|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S lt = 1 a (1) t = S p r e v e n t e d w . p φ l t o r s t o p p e d ( a ( 2 ) t = S )
  • 40. 9/27 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I Number of IPS Alerts ot ∈ O, defender stops remaining lt ∈ {1, .., L}, ot is drawn from r.v. O ∼ fO(·|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S lt = 1 a (1) t = S p r e v e n t e d w . p φ l t o r s t o p p e d ( a ( 2 ) t = S )
  • 41. 9/27 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I Number of IPS Alerts ot ∈ O, defender stops remaining lt ∈ {1, .., L}, ot is drawn from r.v. O ∼ fO(·|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S lt = 1 a (1) t = S p r e v e n t e d w . p φ l t o r s t o p p e d ( a ( 2 ) t = S )
  • 42. 9/27 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I Number of IPS Alerts ot ∈ O, defender stops remaining lt ∈ {1, .., L}, ot is drawn from r.v. O ∼ fO(·|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S lt = 1 a (1) t = S p r e v e n t e d w . p φ l t o r s t o p p e d ( a ( 2 ) t = S )
  • 43. 9/27 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I Number of IPS Alerts ot ∈ O, defender stops remaining lt ∈ {1, .., L}, ot is drawn from r.v. O ∼ fO(·|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S lt = 1 a (1) t = S p r e v e n t e d w . p φ l t o r s t o p p e d ( a ( 2 ) t = S )
  • 44. 9/27 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I IPS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t, defender stops remaining lt ∈ {1, .., L}, fX (∆x1, . . . , ∆xM|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S lt = 1 a (1) t = S p r e v e n t e d w . p φ l t o r s t o p p e d ( a ( 2 ) t = S )
  • 45. 10/27 One-Sided Partial Observability I We assume that the attacker has perfect information. Only the defender has partial information. I The attacker’s view: s1 s2 s3 . . . sn o1 o2 o3 . . . o5 I The defender’s view: s1 s2 s3 . . . sn o1 o2 o3 . . . o5
  • 46. 10/27 One-Sided Partial Observability I We assume that the attacker has perfect information. Only the defender has partial information. I The attacker’s view: s1 s2 s3 . . . sn o1 o2 o3 . . . o5 I The defender’s view: s1 s2 s3 . . . sn o1 o2 o3 . . . o5 I Makes it tractable to compute the defender’s belief b (1) t (st) = P[st|ht] (avoid nested beliefs)
  • 47. 11/27 Outline I Use Case & Approach: I Use case: Intrusion prevention I Approach: Emulation, simulation, and reinforcement learning I Game-Theoretic Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Existence of Nash Equilibria I Structural result: multi-threshold best responses I Our Method for Learning Equilibrium Strategies I Our method for emulating the target infrastructure I Our system identification algorithm I Our reinforcement learning algorithm: T-FP I Results & Conclusion I Numerical evaluation results, conclusion, and future work
  • 48. 11/27 Outline I Use Case & Approach: I Use case: Intrusion prevention I Approach: Emulation, simulation, and reinforcement learning I Game-Theoretic Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Existence of Nash Equilibria I Structural result: multi-threshold best responses I Our Method for Learning Equilibrium Strategies I Our method for emulating the target infrastructure I Our system identification algorithm I Our reinforcement learning algorithm: T-FP I Results & Conclusion I Numerical evaluation results, conclusion, and future work
  • 49. 12/27 Game Analysis I Defender strategy is of the form: π1,l : B → ∆(A1) I Attacker strategy is of the form: π2,l : S × B → ∆(A2) I Defender and attacker objectives: J1(π1,l , π2,l ) = E(π1,l ,π2,l ) " ∞ X t=1 γt−1 Rl (st, at) # J2(π1,l , π2,l ) = −J1 I Best response correspondences: B1(π2,l ) = arg max π1,l ∈Π1 J1(π1,l , π2,l ) B2(π1,l ) = arg max π2,l ∈Π2 J2(π1,l , π2,l ) I Nash equilibrium (π∗ 1,l , π∗ 2,l ): π∗ 1,l ∈ B1(π∗ 2,l ) and π∗ 2,l ∈ B2(π∗ 1,l ) (1)
  • 50. 12/27 Game Analysis I Defender strategy is of the form: π1,l : B → ∆(A1) I Attacker strategy is of the form: π2,l : S × B → ∆(A2) I Defender and attacker objectives: J1(π1,l , π2,l ) = E(π1,l ,π2,l ) " ∞ X t=1 γt−1 Rl (st, at) # J2(π1,l , π2,l ) = −J1 I Best response correspondences: B1(π2,l ) = arg max π1,l ∈Π1 J1(π1,l , π2,l ) B2(π1,l ) = arg max π2,l ∈Π2 J2(π1,l , π2,l ) I Nash equilibrium (π∗ 1,l , π∗ 2,l ): π∗ 1,l ∈ B1(π∗ 2,l ) and π∗ 2,l ∈ B2(π∗ 1,l ) (2)
  • 51. 12/27 Game Analysis I Defender strategy is of the form: π1,l : B → ∆(A1) I Attacker strategy is of the form: π2,l : S × B → ∆(A2) I Defender and attacker objectives: J1(π1,l , π2,l ) = E(π1,l ,π2,l ) " ∞ X t=1 γt−1 Rl (st, at) # J2(π1,l , π2,l ) = −J1 I Best response correspondences: B1(π2,l ) = arg max π1,l ∈Π1 J1(π1,l , π2,l ) B2(π1,l ) = arg max π2,l ∈Π2 J2(π1,l , π2,l ) I Nash equilibrium (π∗ 1,l , π∗ 2,l ): π∗ 1,l ∈ B1(π∗ 2,l ) and π∗ 2,l ∈ B2(π∗ 1,l ) (3)
  • 52. 12/27 Game Analysis I Defender strategy is of the form: π1,l : B → ∆(A1) I Attacker strategy is of the form: π2,l : S × B → ∆(A2) I Defender and attacker objectives: J1(π1,l , π2,l ) = E(π1,l ,π2,l ) " ∞ X t=1 γt−1 Rl (st, at) # J2(π1,l , π2,l ) = −J1 I Best response correspondences: B1(π2,l ) = arg max π1,l ∈Π1 J1(π1,l , π2,l ) B2(π1,l ) = arg max π2,l ∈Π2 J2(π1,l , π2,l ) I Nash equilibrium (π∗ 1,l , π∗ 2,l ): π∗ 1,l ∈ B1(π∗ 2,l ) and π∗ 2,l ∈ B2(π∗ 1,l )
  • 53. 13/27 Game Analysis Theorem Given the one-sided POSG Γ with L ≥ 1, the following holds. (A) Γ has a mixed Nash equilibrium. Further, Γ has a pure Nash equilibrium when s = 0 ⇐⇒ b(1) = 0. (B) Given any attacker strategy π2,l ∈ Π2, if fO|s is totally positive of order 2, there exist values α̃1 ≥ α̃2 ≥ . . . ≥ α̃L ∈ [0, 1] and a defender best response strategy π̃1,l ∈ B1(π2,l ) that satisfies: π̃1,l (b(1)) = S ⇐⇒ b(1) ≥ α̃l l ∈ 1, . . . , L (4) (C) Given a defender strategy π1,l ∈ Π1, where π1,l (S|b(1)) is non-decreasing in b(1) and π1,l (S|1) = 1, there exist values β̃0,1, β̃1,1, . . ., β̃0,L, β̃1,L ∈ [0, 1] and a best response strategy π̃2,l ∈ B2(π1,l ) of the attacker that satisfies: π̃2,l (0, b(1)) = C ⇐⇒ π1,l (S|b(1)) ≥ β̃0,l (5) π̃2,l (1, b(1)) = S ⇐⇒ π1,l (S|b(1)) ≥ β̃1,l (6)
  • 54. 13/27 Game Analysis Theorem Given the one-sided POSG Γ with L ≥ 1, the following holds. (A) Γ has a mixed Nash equilibrium. Further, Γ has a pure Nash equilibrium when s = 0 ⇐⇒ b(1) = 0. (B) Given any attacker strategy π2,l ∈ Π2, if fO|s is totally positive of order 2, there exist values α̃1 ≥ α̃2 ≥ . . . ≥ α̃L ∈ [0, 1] and a defender best response strategy π̃1,l ∈ B1(π2,l ) that satisfies: π̃1,l (b(1)) = S ⇐⇒ b(1) ≥ α̃l l ∈ 1, . . . , L (7) (C) Given a defender strategy π1,l ∈ Π1, where π1,l (S|b(1)) is non-decreasing in b(1) and π1,l (S|1) = 1, there exist values β̃0,1, β̃1,1, . . ., β̃0,L, β̃1,L ∈ [0, 1] and a best response strategy π̃2,l ∈ B2(π1,l ) of the attacker that satisfies: π̃2,l (0, b(1)) = C ⇐⇒ π1,l (S|b(1)) ≥ β̃0,l (8) π̃2,l (1, b(1)) = S ⇐⇒ π1,l (S|b(1)) ≥ β̃1,l (9)
  • 55. 13/27 Game Analysis Theorem Given the one-sided POSG Γ with L ≥ 1, the following holds. (A) Γ has a mixed Nash equilibrium. Further, Γ has a pure Nash equilibrium when s = 0 ⇐⇒ b(1) = 0. (B) Given any attacker strategy π2,l ∈ Π2, if fO|s is totally positive of order 2, there exist values α̃1 ≥ α̃2 ≥ . . . ≥ α̃L ∈ [0, 1] and a defender best response strategy π̃1,l ∈ B1(π2,l ) that satisfies: π̃1,l (b(1)) = S ⇐⇒ b(1) ≥ α̃l l ∈ 1, . . . , L (10) (C) Given a defender strategy π1,l ∈ Π1, where π1,l (S|b(1)) is non-decreasing in b(1) and π1,l (S|1) = 1, there exist values β̃0,1, β̃1,1, . . ., β̃0,L, β̃1,L ∈ [0, 1] and a best response strategy π̃2,l ∈ B2(π1,l ) of the attacker that satisfies: π̃2,l (0, b(1)) = C ⇐⇒ π1,l (S|b(1)) ≥ β̃0,l (11) π̃2,l (1, b(1)) = S ⇐⇒ π1,l (S|b(1)) ≥ β̃1,l (12)
  • 56. 14/27 Structure of Best Response Strategies b(1) 0 1
  • 57. 14/27 Structure of Best Response Strategies b(1) 0 1 S (1) 1 α̃1
  • 58. 14/27 Structure of Best Response Strategies b(1) 0 1 S (1) 1 S (1) 2 α̃1 α̃2
  • 59. 14/27 Structure of Best Response Strategies b(1) 0 1 S (1) 1 S (1) 2 . . . S (1) L α̃1 α̃2 α̃L . . .
  • 60. 14/27 Structure of Best Response Strategies b(1) 0 1 S (1) 1,π2,l S (1) 2,π2,l . . . S (1) L,π2,l α̃1 α̃2 α̃L . . . b(1) 0 1 S (2) 1,1,π1,l S (2) 1,L,π1,l β̃1,1 β̃1,L β̃0,1 . . . β̃0,L . . . S (2) 0,1,π1,l S (2) 0,L,π1,l
  • 61. 15/27 Outline I Use Case & Approach: I Use case: Intrusion prevention I Approach: Emulation, simulation, and reinforcement learning I Game-Theoretic Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Existence of Nash Equilibria I Structural result: multi-threshold best responses I Our Method for Learning Equilibrium Strategies I Our method for emulating the target infrastructure I Our system identification algorithm I Our reinforcement learning algorithm: T-FP I Results & Conclusion I Numerical evaluation results, conclusion, and future work
  • 62. 15/27 Outline I Use Case & Approach: I Use case: Intrusion prevention I Approach: Emulation, simulation, and reinforcement learning I Game-Theoretic Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Existence of Nash Equilibria I Structural result: multi-threshold best responses I Our Method for Learning Equilibrium Strategies I Our method for emulating the target infrastructure I Our system identification algorithm I Our reinforcement learning algorithm: T-FP I Results & Conclusion I Numerical evaluation results, conclusion, and future work
  • 63. 16/27 Our Method for Learning Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Strategy Mapping π Selective Replication Strategy Implementation π Simulation System Reinforcement Learning & Generalization Strategy evaluation & Model estimation Automation & Self-learning systems
  • 64. 17/27 Emulating the Target Infrastructure I Emulate hosts with docker containers I Emulate IPS and vulnerabilities with software I Network isolation and traffic shaping through NetEm in the Linux kernel I Enforce resource constraints using cgroups. I Emulate client arrivals with Poisson process I Internal connections are full-duplex & loss-less with bit capacities of 1000 Mbit/s I External connections are full-duplex with bit capacities of 100 Mbit/s & 0.1% packet loss in normal operation and random bursts of 1% packet loss Attacker Clients . . . Defender 1 IPS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31
  • 65. 18/27 System Identification: Instantiating the Game Model based on Data from the Emulation ˆ f O (o t |0) Probability distribution of # IPS alerts weighted by priority ot 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 ˆ f O (o t |1) Fitted model Distribution st = 0 Distribution st = 1 I We fit a Gaussian mixture distribution ˆ fO as an estimate of fO in the target infrastructure I For each state s, we obtain the conditional distribution ˆ fO|s through expectation-maximization
  • 66. 19/27 Our Reinforcement Learning Approach I We learn a Nash equilibrium (π∗ 1,l,θ(1) , π∗ 2,l,θ(2) ) through fictitious self-play. I In each iteration: 1. Learn a best response strategy of the defender by solving a POMDP π̃1,l,θ(1) ∈ B1(π2,l,θ(2) ). 2. Learn a best response strategy of the attacker by solving an MDP π̃2,l,θ(2) ∈ B2(π1,l,θ(1) ). 3. Store the best response strategies in two buffers Θ1, Θ2 4. Update strategies to be the average of the stored strategies Strategy π2 Strategy π1 Strategy π0 2 Strategy π0 1 . . . Strategy π∗ 2 Strategy π∗ 1 Self-play process (Pseudo-code is available in the paper)
  • 67. 20/27 Our Reinforcement Learning Algorithm for Learning Best-Response Threshold Strategies I We use the structural result that threshold best response strategies exist (Theorem 1) to design an efficient reinforcement learning algorithm to learn best response strategies. I We seek to learn: I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1] I 2L thresholds of the attacker, β̃0,1, β̃1,1, . . . , β̃0,L, β̃1,L ∈ [0, 1] I We learn these thresholds iteratively through Robbins and Monro’s stochastic approximation algorithm.1 Monte-Caro Simulation Stochastic Approximation (RL) θn+1 ˆ θ∗ Performance estimate Ĵ(θn) θ1 θn 1 Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url: https://doi.org/10.1214/aoms/1177729586.
  • 68. 20/27 Our Reinforcement Learning Algorithm for Learning Best-Response Threshold Strategies I We use the structural result that threshold best response strategies exist (Theorem 1) to design an efficient reinforcement learning algorithm to learn best response strategies. I We seek to learn: I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1] I 2L thresholds of the attacker, β̃1, β̃2, . . . , β̃L ∈ [0, 1] I We learn these thresholds iteratively through Robbins and Monro’s stochastic approximation algorithm.2 Monte-Caro Simulation Stochastic Approximation (RL) θn+1 ˆ θ∗ Performance estimate Ĵ(θn) θ1 θn 2 Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url: https://doi.org/10.1214/aoms/1177729586.
  • 69. 20/27 Our Reinforcement Learning Algorithm for Learning Best-Response Threshold Strategies I We use the structural result that threshold best response strategies exist (Theorem 1) to design an efficient reinforcement learning algorithm to learn best response strategies. I We seek to learn: I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1] I 2L thresholds of the attacker, β̃1, β̃2, . . . , β̃L ∈ [0, 1] I We learn these thresholds iteratively through Robbins and Monro’s stochastic approximation algorithm.3 Monte-Caro Simulation Stochastic Approximation (RL) θn+1 ˆ θ∗ Performance estimate Ĵ(θn) θ1 θn 3 Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url: https://doi.org/10.1214/aoms/1177729586.
  • 70. 21/27 Our Reinforcement Learning Algorithm: T-FP 1. We learn the thresholds through simulation. 2. For each iteration n ∈ {1, 2, . . .}, we perturb θ (i) n to obtain θ (i) n + cn∆n and θ (i) n − cn∆n. 3. Then, we run two MDP or POMDP episodes 4. We then use the obtained episode outcomes Ĵi (θ (i) n + cn∆n) and Ĵi (θ (i) n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimator4: ˆ ∇θ (i) n Ji (θ(i) n ) k = Ĵi (θ (i) n + cn∆n) − Ĵi (θ (i) n − cn∆n) 2cn(∆n)k 5. Next, we use the estimated gradient and update the vector of thresholds through the stochastic approximation update: θ (i) n+1 = θ(i) n + an ˆ ∇θ (i) n Ji (θ(i) n ) 4 James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
  • 71. 21/27 Our Reinforcement Learning Algorithm: T-FP 1. We learn the thresholds through simulation. 2. For each iteration n ∈ {1, 2, . . .}, we perturb θ (i) n to obtain θ (i) n + cn∆n and θ (i) n − cn∆n. 3. Then, we run two MDP or POMDP episodes 4. We then use the obtained episode outcomes Ĵi (θ (i) n + cn∆n) and Ĵi (θ (i) n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimator4: ˆ ∇θ (i) n Ji (θ(i) n ) k = Ĵi (θ (i) n + cn∆n) − Ĵi (θ (i) n − cn∆n) 2cn(∆n)k 5. Next, we use the estimated gradient and update the vector of thresholds through the stochastic approximation update: θ (i) n+1 = θ(i) n + an ˆ ∇θ (i) n Ji (θ(i) n ) 4 James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
  • 72. 21/27 Our Reinforcement Learning Algorithm: T-FP 1. We learn the thresholds through simulation. 2. For each iteration n ∈ {1, 2, . . .}, we perturb θ (i) n to obtain θ (i) n + cn∆n and θ (i) n − cn∆n. 3. Then, we run two MDP or POMDP episodes 4. We then use the obtained episode outcomes Ĵi (θ (i) n + cn∆n) and Ĵi (θ (i) n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimator4: ˆ ∇θ (i) n Ji (θ(i) n ) k = Ĵi (θ (i) n + cn∆n) − Ĵi (θ (i) n − cn∆n) 2cn(∆n)k 5. Next, we use the estimated gradient and update the vector of thresholds through the stochastic approximation update: θ (i) n+1 = θ(i) n + an ˆ ∇θ (i) n Ji (θ(i) n ) 4 James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
  • 73. 21/27 Our Reinforcement Learning Algorithm: T-FP 1. We learn the thresholds through simulation. 2. For each iteration n ∈ {1, 2, . . .}, we perturb θ (i) n to obtain θ (i) n + cn∆n and θ (i) n − cn∆n. 3. Then, we run two MDP or POMDP episodes 4. We then use the obtained episode outcomes Ĵi (θ (i) n + cn∆n) and Ĵi (θ (i) n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimator4: ˆ ∇θ (i) n Ji (θ(i) n ) k = Ĵi (θ (i) n + cn∆n) − Ĵi (θ (i) n − cn∆n) 2cn(∆n)k 5. Next, we use the estimated gradient and update the vector of thresholds through the stochastic approximation update: θ (i) n+1 = θ(i) n + an ˆ ∇θ (i) n Ji (θ(i) n ) 4 James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
  • 74. 21/27 Our Reinforcement Learning Algorithm: T-FP 1. We learn the thresholds through simulation. 2. For each iteration n ∈ {1, 2, . . .}, we perturb θ (i) n to obtain θ (i) n + cn∆n and θ (i) n − cn∆n. 3. Then, we run two MDP or POMDP episodes 4. We then use the obtained episode outcomes Ĵi (θ (i) n + cn∆n) and Ĵi (θ (i) n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimator4: ˆ ∇θ (i) n Ji (θ(i) n ) k = Ĵi (θ (i) n + cn∆n) − Ĵi (θ (i) n − cn∆n) 2cn(∆n)k 5. Next, we use the estimated gradient and update the vector of thresholds through the stochastic approximation update: θ (i) n+1 = θ(i) n + an ˆ ∇θ (i) n Ji (θ(i) n ) 4 James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.
  • 75. 22/27 Outline I Use Case Approach: I Use case: Intrusion prevention I Approach: Emulation, simulation, and reinforcement learning I Game-Theoretic Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Existence of Nash Equilibria I Structural result: multi-threshold best responses I Our Method for Learning Equilibrium Strategies I Our method for emulating the target infrastructure I Our system identification algorithm I Our reinforcement learning algorithm: T-FP I Results Conclusion I Numerical evaluation results, conclusion, and future work
  • 76. 22/27 Outline I Use Case Approach: I Use case: Intrusion prevention I Approach: Emulation, simulation, and reinforcement learning I Game-Theoretic Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Existence of Nash Equilibria I Structural result: multi-threshold best responses I Our Method for Learning Equilibrium Strategies I Our method for emulating the target infrastructure I Our system identification algorithm I Our reinforcement learning algorithm: T-FP I Results Conclusion I Numerical evaluation results, conclusion, and future work
  • 77. 23/27 Evaluation Results: Learning Nash Equilibrium Strategies 0 20 40 60 80 100 # training iterations 0 1 2 3 Exploitability 0 20 40 60 80 100 # training iterations −5.0 −2.5 0.0 2.5 5.0 Defender reward per episode 0 20 40 60 80 100 # training iterations 0 1 2 3 Intrusion length (π1,l, π2,l) emulation (π1,l, π2,l) simulation Snort IPS ot ≥ 1 upper bound Learning curves from the self-play process with T-FP; the red curve show simulation results and the blue curves show emulation results; the purple, orange, and black curves relate to baseline strategies; the curves indicate the mean and the 95% confidence interval over four training runs with different random seeds.
  • 78. 24/27 Evaluation Results: Converge Rates 0 10 20 30 40 50 60 running time (min) 0 2 4 Exploitability 0 10 20 30 40 50 60 running time (min) 0 250 500 750 Approximation error (gap) 55.0 57.5 60.0 7.5 10.0 T-FP NFSP HSVI Comparison between T-FP and two baseline algorithms: NFSP and HSVI; the red curve relate to T-FP; the blue curve to NFSP; the purple curve to HSVI; the left plot shows the approximate exploitability metric and the right plot shows the HSVI approximation error5 . 5 Karel Horák, Branislav Bošanský, and Michal Pěchouček. “Heuristic Search Value Iteration for One-Sided Partially Observable Stochastic Games”. In: Proceedings of the AAAI Conference on Artificial Intelligence (2017). url: https://ojs.aaai.org/index.php/AAAI/article/view/10597.
  • 79. 25/27 Evaluation Results: Inspection of Learned Strategies 0.5 b(1) ∈ B 0.0 0.5 1.0 π2,l(S|0, π1(S|b(1))) 0.5 b(1) ∈ B π2,l(S|1, π1(S|b(1))) 0.5 b(1) ∈ B π1,l(S|b(1)) l = 7 l = 4 l = 1 Probability of the stop action S by the learned equilibrium strategies in function of b(1) and l; the left and middle plots show the attacker’s stopping probability when s = 0 and s = 1, respectively; the right plot shows the defender’s stopping probability.
  • 80. 26/27 Evaluation Results: Inspection of Learned Game Values V̂ ∗ 7 (b(1)) V̂ ∗ 1 (b(1)) 1 b(1) 0 −0.25 The value function ˆ V ∗ l (b(1)) computed through the HSVI algorithm with approximation error 4; the blue and red curves relate to l = 7 and l = 1, respectively.
  • 81. 27/27 Conclusions Future Work I Conclusions: I We develop a method to automatically learn security strategies I (1) emulation system; (2) system identification; (3) simulation system; and (4) reinforcement learning. I We apply the method to an intrusion prevention use case I We formulate intrusion prevention as a stopping game I We present a Partially Observed Stochastic Game of the use case I We present a POMDP model of the defender’s problem I We present a MDP model of the attacker’s problem I We apply the stopping theory to establish structural results of the best response strategies and existence of Nash equilibria. I We show numerical results in realistic emulation environment I We show that our method outperforms two state-of-the-art methods I Our research plans: I Extend the model (remove limiting assumptions) I Less restrictions on defender