CDIS Research Workshop 2021 Balingsholm.
We study automated intrusion prevention using reinforcement learning. In a novel approach, we formulate the problem of intrusion prevention as an optimal multiple stopping problem. This formulation allows us insight into the structure of the optimal policies, which we show to be threshold based. Since the computation of the optimal defender policy using dynamic programming is not feasible for practical cases, we develop a reinforcement learning approach to approximate the optimal policy in a target infrastructure. The approach uses an emulation of the infrastructure to evaluate policies and to instantiate a simulation model which then is used to train policies through reinforcement learning. Our results show that the learned policies are close to optimal and that they indeed can be expressed using thresholds.
Presentation on how to chat with PDF using ChatGPT code interpreter
Learning Intrusion Prevention Policies Through Optimal Stopping
1. 1/16
Learning Intrusion Prevention Policies
Through Optimal Stopping
CDIS Research Workshop
Kim Hammar & Rolf Stadler
kimham@kth.se & stadler@kth.se
Division of Network and Systems Engineering
KTH Royal Institute of Technology
Oct 14, 2021
3. 2/16
Goal: Automation and Learning
I Challenges
I Evolving & automated attacks
I Complex infrastructures
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
4. 2/16
Approach: Game Model & Reinforcement Learning
I Challenges:
I Evolving & automated attacks
I Complex infrastructures
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
I Our Approach:
I Model network attack and defense as
games.
I Use reinforcement learning to learn
policies.
I Incorporate learned policies in
self-learning systems.
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
5. 3/16
Use Case: Intrusion Prevention
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and active defense
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
6. 4/16
Use Case: Intrusion Prevention
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and active defense
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
We formulate this use case as an Optimal Stopping problem
7. 5/16
Background on Optimal Stopping Problems
I The General Problem:
I A Markov process (st)T
t=1 is observed sequentially
I Two options per t: (i) continue to observe; or (ii) stop
I Find the optimal stopping time τ∗
:
τ∗
= arg max
τ
Eτ
" τ−1
X
t=1
γt−1
RC
st st+1
+ γτ−1
RS
sτ sτ
#
(1)
where RS
ss0 & RC
ss0 are the stop/continue rewards
I History:
I Studied in the 18th century to analyze a gambler’s fortune
I Formalized by Abraham Wald in 1947
I Since then it has been generalized and developed by (Chow,
Shiryaev & Kolmogorov, Bather, Bertsekas, etc.)
I Applications & Use Cases:
I Change detection, machine replacement, hypothesis testing,
gambling, selling decisions, queue management, advertisement
scheduling, the secretary problem, etc.
8. 5/16
Background on Optimal Stopping Problems
I The General Problem:
I A Markov process (st)T
t=1 is observed sequentially
I Two options per t: (i) continue to observe; or (ii) stop
I History:
I Studied in the 18th century to analyze a gambler’s fortune
I Formalized by Abraham Wald in 19471
I Since then it has been generalized and developed by (Chow2
,
Shiryaev & Kolmogorov3
, Bather4
, Bertsekas5
, etc.)
1
Abraham Wald. Sequential Analysis. Wiley and Sons, New York, 1947.
2
Y. Chow, H. Robbins, and D. Siegmund. “Great expectations: The theory of optimal stopping”. In: 1971.
3
Albert N. Shirayev. Optimal Stopping Rules. Reprint of russian edition from 1969. Springer-Verlag Berlin,
2007.
4
John Bather. Decision Theory: An Introduction to Dynamic Programming and Sequential Decisions. USA:
John Wiley and Sons, Inc., 2000. isbn: 0471976490.
5
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 3rd. Vol. I. Belmont, MA, USA: Athena
Scientific, 2005.
9. 5/16
Background on Optimal Stopping Problems
I The General Problem:
I A Markov process (st)T
t=1 is observed sequentially
I Two options per t: (i) continue to observe; or (ii) stop
I History:
I Studied in the 18th century to analyze a gambler’s fortune
I Formalized by Abraham Wald in 1947
I Since then it has been generalized and developed by (Chow,
Shiryaev & Kolmogorov, Bather, Bertsekas, etc.)
I Applications & Use Cases:
I Change detection6
, selling decisions7
, queue management8
,
advertisement scheduling9
, etc.
6
Alexander G. Tartakovsky et al. “Detection of intrusions in information systems by sequential change-point
methods”. In: Statistical Methodology 3.3 (2006). issn: 1572-3127. doi:
https://doi.org/10.1016/j.stamet.2005.05.003. url:
https://www.sciencedirect.com/science/article/pii/S1572312705000493.
7
Jacques du Toit and Goran Peskir. “Selling a stock at the ultimate maximum”. In: The Annals of Applied
Probability 19.3 (2009). issn: 1050-5164. doi: 10.1214/08-aap566. url:
http://dx.doi.org/10.1214/08-AAP566.
8
Arghyadip Roy et al. “Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision
Processes”. In: CoRR (2019). http://arxiv.org/abs/1912.10325. eprint: 1912.10325.
9
Vikram Krishnamurthy, Anup Aprem, and Sujay Bhatt. “Multiple stopping time POMDPs: Structural results
& application in interactive advertising on social media”. In: Automatica 95 (2018), pp. 385–398. issn:
0005-1098. doi: https://doi.org/10.1016/j.automatica.2018.06.013. url:
https://www.sciencedirect.com/science/article/pii/S0005109818303054.
10. 5/16
Background on Optimal Stopping Problems
I The General Problem:
I A Markov process (st )T
t=1 is observed sequentially
I Two options per t: (i) continue to observe; or (ii) stop
I History:
I Studied in the 18th century to analyze a gambler’s fortune
I Formalized by Abraham Wald in 1947
I Since then it has been generalized and developed by (Chow, Shiryaev & Kolmogorov, Bather,
Bertsekas, etc.)
I Applications & Use Cases:
I Change detection10
, selling decisions11
, queue management12
,
advertisement scheduling13
, intrusion prevention14
etc.
10
Alexander G. Tartakovsky et al. “Detection of intrusions in information systems by sequential change-point
methods”. In: Statistical Methodology 3.3 (2006). issn: 1572-3127. doi:
https://doi.org/10.1016/j.stamet.2005.05.003. url:
https://www.sciencedirect.com/science/article/pii/S1572312705000493.
11
Jacques du Toit and Goran Peskir. “Selling a stock at the ultimate maximum”. In: The Annals of Applied
Probability 19.3 (2009). issn: 1050-5164. doi: 10.1214/08-aap566. url:
http://dx.doi.org/10.1214/08-AAP566.
12
Arghyadip Roy et al. “Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision
Processes”. In: CoRR (2019). http://arxiv.org/abs/1912.10325. eprint: 1912.10325.
13
Vikram Krishnamurthy, Anup Aprem, and Sujay Bhatt. “Multiple stopping time POMDPs: Structural results
& application in interactive advertising on social media”. In: Automatica 95 (2018), pp. 385–398. issn:
0005-1098. doi: https://doi.org/10.1016/j.automatica.2018.06.013. url:
https://www.sciencedirect.com/science/article/pii/S0005109818303054.
14
Kim Hammar and Rolf Stadler. Learning Intrusion Prevention Policies through Optimal Stopping. 2021. arXiv:
2106.07160 [cs.AI].
11. 6/16
Formulating Intrusion Prevention as a Stopping Problem
time-step t = 1
t
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
12. 6/16
Formulating Intrusion Prevention as a Stopping Problem
time-step t = 1
t
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
13. 6/16
Formulating Intrusion Prevention as a Stopping Problem
time-step t = 1
t
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
14. 6/16
Formulating Intrusion Prevention as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
15. 6/16
Formulating Intrusion Prevention as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
affect the intrusion
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
16. 6/16
Formulating Intrusion Prevention as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
affect the intrusion
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
17. 6/16
Formulating Intrusion Prevention as a Stopping Problem
Intrusion event
time-step t = 1 Intrusion ongoing
t
t = T
Early stopping times Stopping times that
affect the intrusion
Episode
Attacker Clients
. . .
Defender
1 IDS
1
alerts
Gateway
7 8 9 10 11
6
5
4
3
2
12
13 14 15 16
17
18
19
21
23
20
22
24
25 26
27 28 29 30 31
I Intrusion Prevention as Optimal Stopping Problem:
I The system evolves in discrete time-steps.
I Defender observes the infrastructure (IDS, log files, etc.).
I An intrusion occurs at an unknown time.
I The defender can make L stops.
I Each stop is associated with a defensive action
I The final stop shuts down the infrastructure.
I Based on the observations, when is it optimal to stop?
I We formalize this problem with a POMDP
18. 7/16
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z, stops remaining
lt ∈ {1, .., L},
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
intrusion starts
Qt = 1
intrusion prevented
lt ≤ lA
lt = 0
final stop
t ≥ It + Tint
intrusion succeeds
t ≥ It + Tint
episode ends
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
19. 7/16
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z, stops remaining
lt ∈ {1, .., L},
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
intrusion starts
Qt = 1
intrusion prevented
lt ≤ lA
lt = 0
final stop
t ≥ It + Tint
intrusion succeeds
t ≥ It + Tint
episode ends
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
20. 7/16
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z, stops remaining
lt ∈ {1, .., L},
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
intrusion starts
Qt = 1
intrusion prevented
lt ≤ lA
lt = 0
final stop
t ≥ It + Tint
intrusion succeeds
t ≥ It + Tint
episode ends
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
21. 7/16
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z, stops remaining
lt ∈ {1, .., L},
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
intrusion starts
Qt = 1
intrusion prevented
lt ≤ lA
lt = 0
final stop
t ≥ It + Tint
intrusion succeeds
t ≥ It + Tint
episode ends
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
22. 7/16
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z, stops remaining
lt ∈ {1, .., L},
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
intrusion starts
Qt = 1
intrusion prevented
lt ≤ lA
lt = 0
final stop
t ≥ It + Tint
intrusion succeeds
t ≥ It + Tint
episode ends
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
23. 7/16
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z, stops remaining
lt ∈ {1, .., L},
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
intrusion starts
Qt = 1
intrusion prevented
lt ≤ lA
lt = 0
final stop
t ≥ It + Tint
intrusion succeeds
t ≥ It + Tint
episode ends
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
24. 7/16
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z, stops remaining
lt ∈ {1, .., L},
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
intrusion starts
Qt = 1
intrusion prevented
lt ≤ lA
lt = 0
final stop
t ≥ It + Tint
intrusion succeeds
t ≥ It + Tint
episode ends
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
25. 7/16
A Partially Observed MDP Model for the Defender
I States:
I Intrusion state st ∈ {0, 1}, terminal ∅.
I Observations:
I Severe/Warning IDS Alerts (∆x, ∆y),
Login attempts ∆z, stops remaining
lt ∈ {1, .., L},
fXYZ (∆x, ∆y, ∆z|st, It, t)
I Actions:
I “Stop” (S) and “Continue” (C)
I Rewards:
I Reward: security and service. Penalty:
false alarms and intrusions
I Transition probabilities:
I Bernoulli process (Qt)T
t=1 ∼ Ber(p)
defines intrusion start It ∼ Ge(p)
I Objective and Horizon:
I max Eπθ
hPT∅
t=1 r(st, at)
i
, T∅
0 1
∅
t ≥ 1
lt > 0
t ≥ 2
lt > 0
intrusion starts
Qt = 1
intrusion prevented
lt ≤ lA
lt = 0
final stop
t ≥ It + Tint
intrusion succeeds
t ≥ It + Tint
episode ends
5 10 15 20 25
intrusion start time t
0.0
0.5
1.0
CDF
I
t
(t)
It ∼ Ge(p = 0.2)
We analyze the optimal policy using optimal stopping theory
26. 8/16
Threshold Properties of the Optimal Defender Policy
h̃t = ∆xt + ∆yt + ∆zt
∆x = Severe IDS alerts at time t
∆y = Warning IDS alerts at time t
∆z = Login attempts at time t
h̃t
0
27. 8/16
Threshold Properties of the Optimal Defender Policy
h̃t = ∆xt + ∆yt + ∆zt
∆x = Severe IDS alerts at time t
∆y = Warning IDS alerts at time t
∆z = Login attempts at time t
h̃t
0
π∗
l (ht) = S ⇐⇒ h̃t ≥ β∗
l , l = 1
S 1
β∗
1
28. 8/16
Threshold Properties of the Optimal Defender Policy
h̃t = ∆xt + ∆yt + ∆zt
∆x = Severe IDS alerts at time t
∆y = Warning IDS alerts at time t
∆z = Login attempts at time t
h̃t
0
π∗
l (ht) = S ⇐⇒ h̃t ≥ β∗
l , l ∈ 1, 2
S 1
S 2
β∗
1
β∗
2
29. 8/16
Threshold Properties of the Optimal Defender Policy
h̃t = ∆xt + ∆yt + ∆zt
∆x = Severe IDS alerts at time t
∆y = Warning IDS alerts at time t
∆z = Login attempts at time t
h̃t
0
S 1
S 2
.
.
.
S L
π∗
l (ht) = S ⇐⇒ h̃t ≥ β∗
l , l ∈ 1, . . . , L
β∗
1
β∗
2
β∗
L
. . .
30. 9/16
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
31. 9/16
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
32. 9/16
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
33. 9/16
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
34. 9/16
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
35. 9/16
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
36. 9/16
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
37. 9/16
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
39. 11/16
Emulating the Client Population
Client Functions Application servers
1 HTTP, SSH, SNMP, ICMP N2, N3, N10, N12
2 IRC, PostgreSQL, SNMP N31, N13, N14, N15, N16
3 FTP, DNS, Telnet N10, N22, N4
Table 1: Emulated client population; each client interacts with
application servers using a set of functions at short intervals.
40. 12/16
Emulating the Defender’s Actions
lt Action Command in the Emulation
3 Reset users deluser –remove-home <username>
2 Blacklist IPs iptables -A INPUT -s <ip> -j DROP
1 Block gateway iptables -A INPUT -i eth0 -j DROP
Table 2: Commands used to implement the defender’s stop actions in the
emulation.
41. 13/16
Static Attackers to Emulate Intrusions
Time-steps t NoviceAttacker ExperiencedAttacker ExpertAttacker
1-It ∼ Ge(0.2) (Intrusion has not started) (Intrusion has not started) (Intrusion has not started)
It + 1-It + 6 Recon1, brute-force attacks (SSH,Telnet,FTP) Recon2, CVE-2017-7494 exploit on N4, Recon3, CVE-2017-7494 exploit on N4,
on N2, N4, N10, login(N2, N4, N10), brute-force attack (SSH) on N2, login(N2, N4), login(N4), backdoor(N4)
backdoor(N2, N4, N10) backdoor(N2, N4), Recon2 Recon3, SQL Injection on N18
It + 7-It + 10 Recon1, CVE-2014-6271 on N17, CVE-2014-6271 on N17, login(N17) login(N18), backdoor(N18),
login(N17), backdoor(N17) backdoor(N17), SSH brute-force attack on N12 Recon3, CVE-2015-1427 on N25
It + 11-It + 14 SSH brute-force attack on N12, login(N12) login(N12), CVE-2010-0426 exploit on N12, login(N25), backdoor(N25),
CVE-2010-0426 exploit on N12, Recon1 Recon2, SQL Injection on N18 Recon3, CVE-2017-7494 exploit on N27
It + 15-It + 16 login(N18), backdoor(N18) login(N27), backdoor(N27)
It + 17-It + 19 Recon2, CVE-2015-1427 on N25, login(N25)
Table 3: Attacker actions to emulate intrusions.
42. 14/16
Learning Intrusion Prevention Policies through Optimal
Stopping
−200
0
200
vs
Novice
Reward per episode
10
15
20
Episode length (steps)
0.6
0.7
0.8
0.9
1.0
P[intrusion interrupted]
0.0
0.2
0.4
P[early stopping]
1
2
3
4
Duration of intrusion
−200
0
vs
Experienced
5
10
15
20
0.6
0.7
0.8
0.9
1.0
0.0
0.2
0.4
1
2
3
4
5
0 250K 500K 750K
# training episodes
−150
−100
−50
0
50
vs
Expert
0 250K 500K 750K
# training episodes
2
4
6
8
10
0 250K 500K 750K
# training episodes
0.6
0.7
0.8
0.9
1.0
0 250K 500K 750K
# training episodes
0.0
0.2
0.4
0 250K 500K 750K
# training episodes
2
4
6
πθ emulation πθ simulation t = 6 baseline (∆x + ∆y) ≥ 1 baseline Upper bound π∗
Learning curves of training defender policies against static attackers,
L = 3.
43. 15/16
Threshold Properties of the Learned Policies, L = 3
0.00
0.25
0.50
0.75
1.00
vs
Novice
z = 0, t = 5 − lt z = 0, t = 10 − lt
0.00
0.25
0.50
0.75
1.00
vs
Experienced
0 50 100 150 200
# total alerts x + y
0.00
0.25
0.50
0.75
1.00
vs
Expert
0 50 100 150 200
# total alerts x + y
πlt=3
θ (S) πlt=2
θ (S) πlt=1
θ (S) b(1)
P
first stop second & third stop
not used
second stop
first stop
third stop
not used
all stops used
when x + y ≥ 40
44. 16/16
Conclusions & Future Work
I Conclusions:
I We develop a method to find learn intrusion prevention
policies
I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement
learning and (5) domain randomization and generalization.
I We formulate intrusion prevention as a multiple stopping
problem
I We present a POMDP model of the use case
I We apply the stopping theory to establish structural results of the optimal policy
I Our research plans:
I Extending the theoretical model
I Relaxing simplifying assumptions (e.g. more dynamic defender actions)
I Active attacker
I Evaluation on real world infrastructures