1/30
Self-Learning Systems for Cyber Security
NSE Seminar
Kim Hammar & Rolf Stadler
kimham@kth.se & stadler@kth.se
Division of Network and Systems Engineering
KTH Royal Institute of Technology
April 9, 2021
2/30
3/30
4/30
Challenges: Evolving and Automated Attacks
I Challenges:
I Evolving & automated attacks
I Complex infrastructures
Attacker Client 1 Client 2 Client 3
Defender
R1
4/30
Goal: Automation and Learning
I Challenges
I Evolving & automated attacks
I Complex infrastructures
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
Attacker Client 1 Client 2 Client 3
Defender
R1
4/30
Approach: Game Model & Reinforcement Learning
I Challenges:
I Evolving & automated attacks
I Complex infrastructures
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
I Our Approach:
I Model network attack and defense as
games.
I Use reinforcement learning to learn
policies.
I Incorporate learned policies in
self-learning systems.
Attacker Client 1 Client 2 Client 3
Defender
R1
5/30
State of the Art
I Game-Learning Programs:
I TD-Gammon, AlphaGo Zero1
, OpenAI Five etc.
I =⇒ Impressive empirical results of RL and self-play
I Attack Simulations:
I Automated threat modeling2
, automated intrusion detection
etc.
I =⇒ Need for automation and better security tooling
I Mathematical Modeling:
I Game theory3
I Markov decision theory
I =⇒ Many security operations involves
strategic decision making
1
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017),
pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
2
Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and
Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security.
ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi:
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
3
Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA:
Cambridge University Press, 2010. isbn: 0521119324.
5/30
State of the Art
I Game-Learning Programs:
I TD-Gammon, AlphaGo Zero4
, OpenAI Five etc.
I =⇒ Impressive empirical results of RL and self-play
I Attack Simulations:
I Automated threat modeling5
, automated intrusion detection
etc.
I =⇒ Need for automation and better security tooling
I Mathematical Modeling:
I Game theory6
I Markov decision theory
I =⇒ Many security operations involves
strategic decision making
4
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017),
pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
5
Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and
Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security.
ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi:
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
6
Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA:
Cambridge University Press, 2010. isbn: 0521119324.
5/30
State of the Art
I Game-Learning Programs:
I TD-Gammon, AlphaGo Zero7
, OpenAI Five etc.
I =⇒ Impressive empirical results of RL and self-play
I Attack Simulations:
I Automated threat modeling8
, automated intrusion detection
etc.
I =⇒ Need for automation and better security tooling
I Mathematical Modeling:
I Game theory9
I Markov decision theory
I =⇒ Many security operations involves
strategic decision making
7
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017),
pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
8
Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and
Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security.
ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi:
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
9
Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA:
Cambridge University Press, 2010. isbn: 0521119324.
6/30
Our Work
I Use Case: Intrusion Prevention
I Our Method:
I Emulating computer infrastructures
I System identification and model creation
I Reinforcement learning and generalization
I Results:
I Learning to Capture The Flag
I Learning to Detect Network Intrusions
I Conclusions and Future Work
7/30
Use Case: Intrusion Prevention
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and patching
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Attacker Client 1 Client 2 Client 3
Defender
R1
8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems
9/30
Emulation System Σ Configuration Space
σi
*
* *
172.18.4.0/24
172.18.19.0/24
172.18.61.0/24
Emulated Infrastructures
R1 R1 R1
Emulation
A cluster of machines that runs a virtualized infrastructure
which replicates important functionality of target systems.
I The set of virtualized configurations define a
configuration space Σ = hA, O, S, U, T , Vi.
I A specific emulation is based on a configuration σi ∈ Σ.
9/30
Emulation System Σ Configuration Space
σi
*
* *
172.18.4.0/24
172.18.19.0/24
172.18.61.0/24
Emulated Infrastructures
R1 R1 R1
Emulation
A cluster of machines that runs a virtualized infrastructure
which replicates important functionality of target systems.
I The set of virtualized configurations define a
configuration space Σ = hA, O, S, U, T , Vi.
I A specific emulation is based on a configuration σi ∈ Σ.
10/30
Emulation: Execution Times of Replicated Operations
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Normalized
Frequency
Action execution times (costs)
|N| = 25
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Action execution times (costs)
|N| = 50
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Action execution times (costs)
|N| = 75
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Action execution times (costs)
|N| = 100
I Fundamental issue: Computational methods for policy
learning typically require samples on the order of 100k − 10M.
I =⇒ Infeasible to optimize in the emulation system
11/30
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k








m3,1
.
.
.
m3,k








m7,1
.
.
.
m7,k








m4,1
.
.
.
m4,k








Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
11/30
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k








m3,1
.
.
.
m3,k








m7,1
.
.
.
m7,k








m4,1
.
.
.
m4,k








Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
11/30
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k








m3,1
.
.
.
m3,k








m7,1
.
.
.
m7,k








m4,1
.
.
.
m4,k








Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)
12/30
System Identification: Estimated Dynamics Model
172.18.4.2
Connections Failed Logins Accounts Online Users Logins Processes
172.18.4.3
172.18.4.10
172.18.4.21
172.18.4.79
IDS
Alerts Alert Priorities Severe Alerts Warning Alerts
Estimated Emulation Dynamics
13/30
System Identification: Estimated Dynamics Model
−5 0 5 10 15 20
# New TCP/UDP Connections
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New TCP/UDP Connections
0 10 20 30 40 50
# New Failed Login Attempts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Failed Login Attempts
−30 −20 −10 0 10 20 30
# Created User Accounts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Created User Accounts
−0.04 −0.02 0.00 0.02 0.04
# New Logged in Users
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Logged in Users
−0.04 −0.02 0.00 0.02 0.04
# Login Events
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Login Events
0 20 40 60 80 100 120 140
# Created Processes
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Created Processes
Node IP: 172.18.4.2
(b0, a0) (b1, a0) ...
14/30
System Identification: Estimated Dynamics Model
0 20 40 60 80 100 120
# IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
IDS Alerts
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
# Severe IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Severe IDS Alerts
0 20 40 60 80 100
# Warning IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Warning IDS Alerts
0 50 100 150 200 250
# IDS Alert Priorities
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
IDS Alert Priorities
IDS Dynamics
(b0, a0) (b1, a0) ...
15/30
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=0 γt
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eo∼ρπθ ,a∼πθ
[R]
I Maximize J(θ) by stochastic gradient ascent with
gradient
∇θJ(θ) = Eo∼ρπθ ,a∼πθ
[∇θ log πθ(a|o)Aπθ
(o, a)]
I Domain-Specific Challenges:
I Partial observability
I Large state space |S| = (w + 1)|N|·m·(m+1)
I Large action space |A| = |N| · (m + 1)
I Non-stationary Environment due to presence of
adversary
I Generalization
Agent
Environment
at
st+1
rt+1
15/30
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=0 γt
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eo∼ρπθ ,a∼πθ
[R]
I Maximize J(θ) by stochastic gradient ascent with
gradient
∇θJ(θ) = Eo∼ρπθ ,a∼πθ
[∇θ log πθ(a|o)Aπθ
(o, a)]
I Domain-Specific Challenges:
I Partial observability
I Large state space |S| = (w + 1)|N|·m·(m+1)
I Large action space |A| = |N| · (m + 1)
I Non-stationary Environment due to presence of
adversary
I Generalization
Agent
Environment
at
st+1
rt+1
15/30
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=0 γt
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eo∼ρπθ ,a∼πθ
[R]
I Maximize J(θ) by stochastic gradient ascent with
gradient
∇θJ(θ) = Eo∼ρπθ ,a∼πθ
[∇θ log πθ(a|o)Aπθ
(o, a)]
I Domain-Specific Challenges:
I Partial observability
I Large state space |S| = (w + 1)|N|·m·(m+1)
I Large action space |A| = |N| · (m + 1)
I Non-stationary Environment due to presence of
adversary
I Generalization
Agent
Environment
at
st+1
rt+1
15/30
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
PT
t=0
γt
rt+1

I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eo∼ρπθ ,a∼πθ
[R]
I Maximize J(θ) by stochastic gradient ascent with gradient
∇θJ(θ) = Eo∼ρπθ ,a∼πθ
[∇θ log πθ(a|o)Aπθ (o, a)]
I Domain-Specific Challenges:
I Partial observability
I Large state space |S| = (w + 1)|N |·m·(m+1)
I Large action space |A| = |N | · (m + 1)
I Non-stationary Environment due to presence of adversary
I Generalization
I Finding Effective Security Strategies through
Reinforcement Learning and Self-Playa
a
Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and
Self-Play”. In: International Conference on Network and Service Management (CNSM 2020) (CNSM 2020). Izmir,
Turkey, Nov. 2020.
Agent
Environment
at
st+1
rt+1
16/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation 
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning 
Generalization
Policy evaluation 
Model estimation
Automation 
Self-learning systems
17/30
Learning Capture-the-Flag Strategies: Target Infrastructure
I Topology:
I 32 Servers, 1 IDS (Snort), 3 Clients
I Services
I 1 SNMP, 1 Cassandra, 2 Kafka, 8 HTTP, 1 DNS, 1 SMTP, 2 NTP, 5
IRC, 1 Teamspeak, 1 MongoDB, 1 Samba, 1 RethinkDB, 1
CockroachDB, 2 Postgres, 3 FTP, 15 SSH, 2 FTP
I Vulnerabilities
I 2 CVE-2010-0426, 2 CVE-2010-0426, 1 CVE-2015-3306, 1
CVE-2015-5602, 1 CVE-2016-10033, 1 CVE-2017-7494, 1
CVE-2014-6271
I 5 Brute-force vulnerabilities
I Operating Systems
I 14 Ubuntu-20, 9 Ubuntu-14, 1 Debian 9:2, 2 Debian Wheezy, 5
Debian Jessie, 1 Kali
I Traffic
I FTP, SSH, IRC, SNMP, HTTP, Telnet, IRC, Postgres, MongoDB,
Samba
I curl, ping, tracerotue, nmap..
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
18/30
Learning Capture-the-Flag Strategies: System Model 1/3
I A hacker (pentester) has T time-periods to
collect flags hidden in the infrastructure.
I The hacker is located at a dedicated starting
position N0 and can connect to a gateway
that exposes public-facing services in the
infrastructure.
I The hacker has a pre-defined set
(cardinality ∼ 200) of network/shell
commands available.
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
19/30
Learning Capture-the-Flag Strategies: System Model 2/3
I By execution of commands, the hacker
collects information
I Open ports, failed/successful exploits,
vulnerabilities, costs, OS, ...
I Sequences of commands can yield
shell-access to nodes
I Given shell access, the hacker can search
for flags
I Associated with each command is a cost c
(execution time) and noise n (IDS alerts).
I The objective is to capture all flags with the
minimal cost within the fixed time horizon T.
What strategy achieves this end?
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
19/30
Learning Capture-the-Flag Strategies: System Model 2/3
I By execution of commands, the hacker
collects information
I Open ports, failed/successful exploits,
vulnerabilities, costs, OS, ...
I Sequences of commands can yield
shell-access to nodes
I Given shell access, the hacker can search
for flags
I Associated with each command is a cost c
(execution time) and noise n (IDS alerts).
I The objective is to capture all flags with the
minimal cost within the fixed time horizon T.
What strategy achieves this end?
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
20/30
Learning Capture-the-Flag Strategies: System Model 3/3
I Contextual Stochastic CTF with Partial
Information
I Model infrastructure as a graph G = hN, Ei
I There are k flags at nodes C ⊆ N
I Ni ∈ N has a node state si of m attributes
I Network state
s = {sA, si | i ∈ N} ∈ R|N|×m+|N|
I Hacker observes oA
⊂ s
I Action space: A = {aA
1 , . . . , aA
k }, aA
i
(commands)
I ∀(b, a) ∈ A × S, there is a probability ~
w
A,(x)
i,j of
failure  a probability of detection
ϕ(det(si ) · n
A,(x)
i,j )
I State transitions s → s0
are decided by a
discrete dynamical system s0
= F(s, a)
I Exact dynamics (F, cA
, nA
, wA
, det(·), ϕ(·)),
are unknown to us!
N0, ~
S0
N1, ~
S1
N2, ~
S2
N3, ~
S3
N4, ~
S4
N5, ~
S5 N6, ~
S6
N7, ~
S7
~
cA
2,3, ~
nA
2,3, ~
wA
2,3
~
cA
3,2, ~
nA
3,2, ~
wA
3,2
~
cA
2,1, ~
nA
2,1, ~
wA
2,1
~
cA
1,2, ~
nA
1,2, ~
wA
1,2
~
c
A
0,1
,
~
n
A
0,1
,
~
w
A
0,1
~
cA
0,2, ~
nA
0,2, ~
wA
0,2
~
cA
3,6, ~
nA
3,6, ~
wA
3,6
~
cA
6,7, ~
nA
6,7, ~
wA
6,7
~
cA
7,4, ~
nA
7,4, ~
wA
7,4
~
c
A
4
,6
,
~
n
A
4
,6
,
~
w
A
4
,6
~
c
A
4
,5
,
~
n
A
4
,5
,
~
w
A
4
,5
~
c
A
5
,4
,
~
n
A
5
,4
,
~
w
A
5
,4
~
cA
1,5, ~
nA
1,5, ~
wA
1,5
~
c
A
5,2
,
~
n
A
5,2
,
~
w
A
5,2
~
c
A
2,
4
,
~
n
A
2,
4
,
~
w
A
2,
4
~
c A
1,4 , ~
n A
1,4 , ~
w A
1,4
Si ∈ Rm, ~
cA
i,j ∈ Rk
+
~
nA
i,j ∈ Rk
+, ~
wA
i,j ∈ [0, 1]k
~
cD ∈ Rl
+, ~
wD ∈ [0, 1]l
Graphical Model.
20/30
Learning Capture-the-Flag Strategies: System Model 3/3
I Contextual Stochastic CTF with Partial
Information
I Model infrastructure as a graph G = hN, Ei
I There are k flags at nodes C ⊆ N
I Ni ∈ N has a node state si of m attributes
I Network state
s = {sA, si | i ∈ N} ∈ R|N|×m+|N|
I Hacker observes oA
⊂ s
I Action space: A = {aA
1 , . . . , aA
k }, aA
i
(commands)
I ∀(b, a) ∈ A × S, there is a probability ~
w
A,(x)
i,j of
failure  a probability of detection
ϕ(det(si ) · n
A,(x)
i,j )
I State transitions s → s0
are decided by a
discrete dynamical system s0
= F(s, a)
I Exact dynamics (F, cA
, nA
, wA
, det(·), ϕ(·)),
are unknown to us!
N0, ~
S0
N1, ~
S1
N2, ~
S2
N3, ~
S3
N4, ~
S4
N5, ~
S5 N6, ~
S6
N7, ~
S7
~
cA
2,3, ~
nA
2,3, ~
wA
2,3
~
cA
3,2, ~
nA
3,2, ~
wA
3,2
~
cA
2,1, ~
nA
2,1, ~
wA
2,1
~
cA
1,2, ~
nA
1,2, ~
wA
1,2
~
c
A
0,1
,
~
n
A
0,1
,
~
w
A
0,1
~
cA
0,2, ~
nA
0,2, ~
wA
0,2
~
cA
3,6, ~
nA
3,6, ~
wA
3,6
~
cA
6,7, ~
nA
6,7, ~
wA
6,7
~
cA
7,4, ~
nA
7,4, ~
wA
7,4
~
c
A
4
,6
,
~
n
A
4
,6
,
~
w
A
4
,6
~
c
A
4
,5
,
~
n
A
4
,5
,
~
w
A
4
,5
~
c
A
5
,4
,
~
n
A
5
,4
,
~
w
A
5
,4
~
cA
1,5, ~
nA
1,5, ~
wA
1,5
~
c
A
5,2
,
~
n
A
5,2
,
~
w
A
5,2
~
c
A
2,
4
,
~
n
A
2,
4
,
~
w
A
2,
4
~
c A
1,4 , ~
n A
1,4 , ~
w A
1,4
Si ∈ Rm, ~
cA
i,j ∈ Rk
+
~
nA
i,j ∈ Rk
+, ~
wA
i,j ∈ [0, 1]k
~
cD ∈ Rl
+, ~
wD ∈ [0, 1]l
Graphical Model.
21/30
Learning Capture-the-Flag Strategies
0 50 100 150 200 250 300 350 400
# Iteration
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
Avg
Episode
Regret
Episodic regret
Generated Simulation
Test Emulation Env
Train Emulation Env
lower bound π∗
Learning curves (simulation and emulation) of our
proposed method.
Attacker Client 1 Client 2 Client 3
Defender
R1
Evaluation infrastructure.
22/30
Learning Capture-the-Flag Strategies
0 50 100 150 200 250 300 350 400
# Iteration
−0.5
0.0
0.5
1.0
Episodic rewards
0 50 100 150 200 250 300 350 400
# Iteration
0
5
10
15
20
Episodic regret
0 50 100 150 200 250 300 350 400
# Iteration
4
6
8
10
12
14
Episodic steps
Configuration 1 Configuration 2 Configuration 3 π∗
R1
alerts
Gateway
172.18.4.0/24
R1
alerts
Gateway
172.18.3.0/24
R1
Gateway
alerts 172.18.2.0/24
Application
server
Intrusion
detection
system
R1
Gateway
Access
switch
Flag Traffic
generator
Configuration 1 Configuration 3
Configuration 2
23/30
Learning to Detect Network Intrusions: Target
Infrastructure
I Topology:
I 6 Servers, 1 IDS (Snort), 3 Clients
I Services
I 3 SSH, 2 HTTP, 1 DNS, 1 Telnet, 1 FTP, 1 MongoDB, 2
SMTP, 1 Tomcat, 1 Teamspeak3, 1 SNMP, 1 IRC, 1 Postgres,
1 NTP
I Vulnerabilities
I 1 CVE-2010-0426, 3 Brute-force vulnerabilities
I Operating Systems
I 4 Ubuntu-20, 1 Ubuntu-14, 1 Kali
I Traffic
I FTP, SSH, IRC, SNMP, HTTP, Telnet, IRC, Postgres,
MongoDB,
I curl, ping, tracerotue, nmap..
Attacker Client 1 Client 2 Client 3
Defender
R1
Evaluation infrastructure.
24/30
Learning to Detect Network Intrusions: System Model
(1/3)
I An admin should manage the
infrastructure for T time-periods.
I The admin can monitor the
infrastructure to get a belief about it’s
state bt
I b1, . . . , bT−1 can be assumed to be
generated from some unknown
distribution ϕ.
I If the admin suspects that the
infrastructure is being intruded based on
bt, he can suspend the suspicious
user/traffic.
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
24/30
Learning to Detect Network Intrusions: System Model
(1/3)
I An admin should manage the
infrastructure for T time-periods.
I The admin can monitor the
infrastructure to get a belief about it’s
state bt
I b1, . . . , bT−1 can be assumed to be
generated from some unknown
distribution ϕ.
I If the admin suspects that the
infrastructure is being intruded based on
bt, he can suspend the suspicious
user/traffic.
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
25/30
Learning to Detect Network Intrusions: System Model
(2/3)
I Suspending traffic from a true intrusion
yields a reward r (salary bonus)
I Not suspending traffic of a true
intrusion, incurs a cost c (admin is fired)
I Suspending traffic of a false intrusion,
incurs a cost of o (breaking the SLA)
I The objective is to to decide an optimal
response for suspending network traffic.
What strategy achieves this end?
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
25/30
Learning to Detect Network Intrusions: System Model
(2/3)
I Suspending traffic from a true intrusion
yields a reward r (salary bonus)
I Not suspending traffic of a true
intrusion, incurs a cost c (admin is fired)
I Suspending traffic of a false intrusion,
incurs a cost of o (breaking the SLA)
I The objective is to to decide an optimal
response for suspending network traffic.
What strategy achieves this end?
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
26/30
Learning to Detect Network Intrusions: System Model
(3/3)
I Optimal Stopping Problem
I Action space A = {STOP, CONTINUE}
I Belief state space B ∈ R8+10·m
I A belief state b ∈ B contains relevant
metrics to detect intrusions
I Alerts from IDS, Entries in
/var/log/auth, logged in users, TCP
connections, processes, ...
I Reward function R
I r(bt, STOP, st) = 1intrusion
β
ti
I β is a positive constant and ti is the
number of nodes compromised by the
attacker
I =⇒ incentive to detect intrusion early.
Attacker Client 1 Client 2 Client 3
Defender
R1
Target infrastructure.
27/30
Structural Properties of the Optimal Policy
I Assumptions: Always an intrusion before T, f (bt): probability of
intrusion given bt, bt and p are Markov, f (bt) is non-decreasing in t.
I Claim: Optimal policy is a threshold based policy
I Necessary condition for optimality (Bellman):
ut(bt) = sup
a

rt(bt, a) +
X
b0∈B
pt(b0
|bt, a)ut+1(b0
, a)
#
(1)
I Thus I have that it is optimal to stop at state bt iff
f (bt) ·
β
ti
≥
X
b0∈B
ϕ(b0
)ut+1(b0
) (2)
I Stopping threshold αt:
αt ,
ti
β
X
b0∈B
ϕ(b0
)ut+1(b0
) (3)
27/30
Structural Properties of the Optimal Policy
I Assumptions: Always an intrusion before T, f (bt): probability of
intrusion given bt, bt and p are Markov, f (bt) is non-decreasing in t.
I Claim: Optimal policy is a threshold based policy
I Necessary condition for optimality (Bellman):
ut(bt) = sup
a

rt(bt, a) +
X
b0∈B
pt(b0
|bt, a)ut+1(b0
, a)
#
(4)
= max

f (bt) ·
β
ti
,
X
b0∈B
ϕ(b0
)ut+1(b0
)
#
(5)
I Thus I have that it is optimal to stop at state bt iff
f (bt) ·
β
ti
≥
X
b0∈B
ϕ(b0
)ut+1(b0
) (6)
I Stopping threshold αt:
αt ,
ti
β
X
b0∈B
ϕ(b0
)ut+1(b0
) (7)
27/30
Structural Properties of the Optimal Policy
I Assumptions: Always an intrusion before T, f (bt): probability of
intrusion given bt, bt and p are Markov, f (bt) is non-decreasing in t.
I Claim: Optimal policy is a threshold based policy
I Necessary condition for optimality (Bellman):
ut(bt) = sup
a

rt(bt, a) +
X
b0∈B
pt(b0
|bt, a)ut+1(b0
, a)
#
(8)
= max

f (bt) ·
β
ti
,
X
b0∈B
ϕ(b0
)ut+1(b0
)
#
(9)
I Thus I have that it is optimal to stop at state bt iff
f (bt) ·
β
ti
≥
X
b0∈B
ϕ(b0
)ut+1(b0
) (10)
I Stopping threshold αt:
αt ,
ti
β
X
b0∈B
ϕ(b0
)ut+1(b0
) (11)
27/30
Structural Properties of the Optimal Policy
I Assumptions: Always an intrusion before T, f (bt): probability of
intrusion given bt, bt and p are Markov, f (bt) is non-decreasing in t.
I Claim: Optimal policy is a threshold based policy
I Necessary condition for optimality (Bellman):
ut(bt) = sup
a

rt(bt, a) +
X
b0∈B
pt(b0
|bt, a)ut+1(b0
, a)
#
(12)
= max

f (bt) ·
β
ti
,
X
b0∈B
ϕ(b0
)ut+1(b0
)
#
(13)
I Thus I have that it is optimal to stop at state bt iff
f (bt) ·
β
ti
≥
X
b0∈B
ϕ(b0
)ut+1(b0
) (14)
I Stopping threshold αt:
αt ,
ti
β
X
b0∈B
ϕ(b0
)ut+1(b0
) (15)
28/30
Learning to Detect Network Intrusions
0 25 50 75 100 125 150 175 200
# Iteration
−2
0
2
4
6
8
10
Avg
Episode
Rewards
Episodic rewards
πθ simulation
πθ emulation
Snort-Severe simulation
Snort-warning simulation
Snort-critical simulation
/var/log/auth simulation
Snort-Severe emulation
Snort-warning emulation
Snort-critical emulation
/var/log/auth emulation
upper bound π∗
Learning curves (simulation and emulation)
of our proposed method.
Attacker Client 1 Client 2 Client 3
Defender
R1
Evaluation infrastructure.
29/30
Learning to Detect Network Intrusions
0 25 50 75 100 125 150 175 200
# Iteration
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Fraction
TP (intrusion detected), FP (early stopping), FN (intrusion)
True Positive (TP)
False Positive (FP)
False Negative (FN)
max
Trade-off between detection and false positives
30/30
Conclusions  Future Work
I Conclusions:
I We develop a method to find effective strategies for intrusion
prevention
I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement
learning and (5) domain randomization and generalization.
I We show that self-learning can be successfully applied to
network infrastructures.
I Self-play reinforcement learning in Markov security game
I Key challenges: stable convergence, sample efficiency,
complexity of emulations, large state and action spaces
I Our research plans:
I Improving the system identification algorithm  generalization
I Evaluation on real world infrastructures

Self-learning systems for cyber security

  • 1.
    1/30 Self-Learning Systems forCyber Security NSE Seminar Kim Hammar & Rolf Stadler kimham@kth.se & stadler@kth.se Division of Network and Systems Engineering KTH Royal Institute of Technology April 9, 2021
  • 2.
  • 3.
  • 4.
    4/30 Challenges: Evolving andAutomated Attacks I Challenges: I Evolving & automated attacks I Complex infrastructures Attacker Client 1 Client 2 Client 3 Defender R1
  • 5.
    4/30 Goal: Automation andLearning I Challenges I Evolving & automated attacks I Complex infrastructures I Our Goal: I Automate security tasks I Adapt to changing attack methods Attacker Client 1 Client 2 Client 3 Defender R1
  • 6.
    4/30 Approach: Game Model& Reinforcement Learning I Challenges: I Evolving & automated attacks I Complex infrastructures I Our Goal: I Automate security tasks I Adapt to changing attack methods I Our Approach: I Model network attack and defense as games. I Use reinforcement learning to learn policies. I Incorporate learned policies in self-learning systems. Attacker Client 1 Client 2 Client 3 Defender R1
  • 7.
    5/30 State of theArt I Game-Learning Programs: I TD-Gammon, AlphaGo Zero1 , OpenAI Five etc. I =⇒ Impressive empirical results of RL and self-play I Attack Simulations: I Automated threat modeling2 , automated intrusion detection etc. I =⇒ Need for automation and better security tooling I Mathematical Modeling: I Game theory3 I Markov decision theory I =⇒ Many security operations involves strategic decision making 1 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 2 Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security. ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi: 10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799. 3 Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA: Cambridge University Press, 2010. isbn: 0521119324.
  • 8.
    5/30 State of theArt I Game-Learning Programs: I TD-Gammon, AlphaGo Zero4 , OpenAI Five etc. I =⇒ Impressive empirical results of RL and self-play I Attack Simulations: I Automated threat modeling5 , automated intrusion detection etc. I =⇒ Need for automation and better security tooling I Mathematical Modeling: I Game theory6 I Markov decision theory I =⇒ Many security operations involves strategic decision making 4 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 5 Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security. ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi: 10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799. 6 Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA: Cambridge University Press, 2010. isbn: 0521119324.
  • 9.
    5/30 State of theArt I Game-Learning Programs: I TD-Gammon, AlphaGo Zero7 , OpenAI Five etc. I =⇒ Impressive empirical results of RL and self-play I Attack Simulations: I Automated threat modeling8 , automated intrusion detection etc. I =⇒ Need for automation and better security tooling I Mathematical Modeling: I Game theory9 I Markov decision theory I =⇒ Many security operations involves strategic decision making 7 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 8 Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security. ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi: 10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799. 9 Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA: Cambridge University Press, 2010. isbn: 0521119324.
  • 10.
    6/30 Our Work I UseCase: Intrusion Prevention I Our Method: I Emulating computer infrastructures I System identification and model creation I Reinforcement learning and generalization I Results: I Learning to Capture The Flag I Learning to Detect Network Intrusions I Conclusions and Future Work
  • 11.
    7/30 Use Case: IntrusionPrevention I A Defender owns an infrastructure I Consists of connected components I Components run network services I Defender defends the infrastructure by monitoring and patching I An Attacker seeks to intrude on the infrastructure I Has a partial view of the infrastructure I Wants to compromise specific components I Attacks by reconnaissance, exploitation and pivoting Attacker Client 1 Client 2 Client 3 Defender R1
  • 12.
    8/30 Our Method forFinding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 13.
    8/30 Our Method forFinding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 14.
    8/30 Our Method forFinding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 15.
    8/30 Our Method forFinding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 16.
    8/30 Our Method forFinding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 17.
    8/30 Our Method forFinding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 18.
    8/30 Our Method forFinding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 19.
    8/30 Our Method forFinding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 20.
    8/30 Our Method forFinding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems
  • 21.
    9/30 Emulation System ΣConfiguration Space σi * * * 172.18.4.0/24 172.18.19.0/24 172.18.61.0/24 Emulated Infrastructures R1 R1 R1 Emulation A cluster of machines that runs a virtualized infrastructure which replicates important functionality of target systems. I The set of virtualized configurations define a configuration space Σ = hA, O, S, U, T , Vi. I A specific emulation is based on a configuration σi ∈ Σ.
  • 22.
    9/30 Emulation System ΣConfiguration Space σi * * * 172.18.4.0/24 172.18.19.0/24 172.18.61.0/24 Emulated Infrastructures R1 R1 R1 Emulation A cluster of machines that runs a virtualized infrastructure which replicates important functionality of target systems. I The set of virtualized configurations define a configuration space Σ = hA, O, S, U, T , Vi. I A specific emulation is based on a configuration σi ∈ Σ.
  • 23.
    10/30 Emulation: Execution Timesof Replicated Operations 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Normalized Frequency Action execution times (costs) |N| = 25 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Action execution times (costs) |N| = 50 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Action execution times (costs) |N| = 75 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Action execution times (costs) |N| = 100 I Fundamental issue: Computational methods for policy learning typically require samples on the order of 100k − 10M. I =⇒ Infeasible to optimize in the emulation system
  • 24.
    11/30 From Emulation toSimulation: System Identification R1 m1 m2 m3 m4 m5 m6 m7 m1,1 . . . m1,k h i m5,1 . . . m5,k h i m6,1 . . . m6,k h i m2,1 . . . m2,k         m3,1 . . . m3,k         m7,1 . . . m7,k         m4,1 . . . m4,k         Emulated Network Abstract Model POMDP Model hS, A, P, R, γ, O, Zi a1 a2 a3 . . . s1 s2 s3 . . . o1 o2 o3 . . . I Abstract Model Based on Domain Knowledge: Models the set of controls, the objective function, and the features of the emulated network. I Defines the static parts a POMDP model. I Dynamics Model (P, Z) Identified using System Identification: Algorithm based on random walks and maximum-likelihood estimation. M(b0 |b, a) , n(b, a, b0) P j0 n(s, a, j0)
  • 25.
    11/30 From Emulation toSimulation: System Identification R1 m1 m2 m3 m4 m5 m6 m7 m1,1 . . . m1,k h i m5,1 . . . m5,k h i m6,1 . . . m6,k h i m2,1 . . . m2,k         m3,1 . . . m3,k         m7,1 . . . m7,k         m4,1 . . . m4,k         Emulated Network Abstract Model POMDP Model hS, A, P, R, γ, O, Zi a1 a2 a3 . . . s1 s2 s3 . . . o1 o2 o3 . . . I Abstract Model Based on Domain Knowledge: Models the set of controls, the objective function, and the features of the emulated network. I Defines the static parts a POMDP model. I Dynamics Model (P, Z) Identified using System Identification: Algorithm based on random walks and maximum-likelihood estimation. M(b0 |b, a) , n(b, a, b0) P j0 n(s, a, j0)
  • 26.
    11/30 From Emulation toSimulation: System Identification R1 m1 m2 m3 m4 m5 m6 m7 m1,1 . . . m1,k h i m5,1 . . . m5,k h i m6,1 . . . m6,k h i m2,1 . . . m2,k         m3,1 . . . m3,k         m7,1 . . . m7,k         m4,1 . . . m4,k         Emulated Network Abstract Model POMDP Model hS, A, P, R, γ, O, Zi a1 a2 a3 . . . s1 s2 s3 . . . o1 o2 o3 . . . I Abstract Model Based on Domain Knowledge: Models the set of controls, the objective function, and the features of the emulated network. I Defines the static parts a POMDP model. I Dynamics Model (P, Z) Identified using System Identification: Algorithm based on random walks and maximum-likelihood estimation. M(b0 |b, a) , n(b, a, b0) P j0 n(s, a, j0)
  • 27.
    12/30 System Identification: EstimatedDynamics Model 172.18.4.2 Connections Failed Logins Accounts Online Users Logins Processes 172.18.4.3 172.18.4.10 172.18.4.21 172.18.4.79 IDS Alerts Alert Priorities Severe Alerts Warning Alerts Estimated Emulation Dynamics
  • 28.
    13/30 System Identification: EstimatedDynamics Model −5 0 5 10 15 20 # New TCP/UDP Connections 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] New TCP/UDP Connections 0 10 20 30 40 50 # New Failed Login Attempts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] New Failed Login Attempts −30 −20 −10 0 10 20 30 # Created User Accounts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] Created User Accounts −0.04 −0.02 0.00 0.02 0.04 # New Logged in Users 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] New Logged in Users −0.04 −0.02 0.00 0.02 0.04 # Login Events 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] New Login Events 0 20 40 60 80 100 120 140 # Created Processes 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] Created Processes Node IP: 172.18.4.2 (b0, a0) (b1, a0) ...
  • 29.
    14/30 System Identification: EstimatedDynamics Model 0 20 40 60 80 100 120 # IDS Alerts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] IDS Alerts 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 # Severe IDS Alerts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] Severe IDS Alerts 0 20 40 60 80 100 # Warning IDS Alerts 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] Warning IDS Alerts 0 50 100 150 200 250 # IDS Alert Priorities 0.0 0.2 0.4 0.6 0.8 1.0 P[·|(b i , a i )] IDS Alert Priorities IDS Dynamics (b0, a0) (b1, a0) ...
  • 30.
    15/30 Policy Optimization inthe Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E hPT t=0 γt rt+1 i I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eo∼ρπθ ,a∼πθ [R] I Maximize J(θ) by stochastic gradient ascent with gradient ∇θJ(θ) = Eo∼ρπθ ,a∼πθ [∇θ log πθ(a|o)Aπθ (o, a)] I Domain-Specific Challenges: I Partial observability I Large state space |S| = (w + 1)|N|·m·(m+1) I Large action space |A| = |N| · (m + 1) I Non-stationary Environment due to presence of adversary I Generalization Agent Environment at st+1 rt+1
  • 31.
    15/30 Policy Optimization inthe Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E hPT t=0 γt rt+1 i I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eo∼ρπθ ,a∼πθ [R] I Maximize J(θ) by stochastic gradient ascent with gradient ∇θJ(θ) = Eo∼ρπθ ,a∼πθ [∇θ log πθ(a|o)Aπθ (o, a)] I Domain-Specific Challenges: I Partial observability I Large state space |S| = (w + 1)|N|·m·(m+1) I Large action space |A| = |N| · (m + 1) I Non-stationary Environment due to presence of adversary I Generalization Agent Environment at st+1 rt+1
  • 32.
    15/30 Policy Optimization inthe Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E hPT t=0 γt rt+1 i I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eo∼ρπθ ,a∼πθ [R] I Maximize J(θ) by stochastic gradient ascent with gradient ∇θJ(θ) = Eo∼ρπθ ,a∼πθ [∇θ log πθ(a|o)Aπθ (o, a)] I Domain-Specific Challenges: I Partial observability I Large state space |S| = (w + 1)|N|·m·(m+1) I Large action space |A| = |N| · (m + 1) I Non-stationary Environment due to presence of adversary I Generalization Agent Environment at st+1 rt+1
  • 33.
    15/30 Policy Optimization inthe Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E PT t=0 γt rt+1 I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eo∼ρπθ ,a∼πθ [R] I Maximize J(θ) by stochastic gradient ascent with gradient ∇θJ(θ) = Eo∼ρπθ ,a∼πθ [∇θ log πθ(a|o)Aπθ (o, a)] I Domain-Specific Challenges: I Partial observability I Large state space |S| = (w + 1)|N |·m·(m+1) I Large action space |A| = |N | · (m + 1) I Non-stationary Environment due to presence of adversary I Generalization I Finding Effective Security Strategies through Reinforcement Learning and Self-Playa a Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and Self-Play”. In: International Conference on Network and Service Management (CNSM 2020) (CNSM 2020). Izmir, Turkey, Nov. 2020. Agent Environment at st+1 rt+1
  • 34.
    16/30 Our Method forFinding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning Generalization Policy evaluation Model estimation Automation Self-learning systems
  • 35.
    17/30 Learning Capture-the-Flag Strategies:Target Infrastructure I Topology: I 32 Servers, 1 IDS (Snort), 3 Clients I Services I 1 SNMP, 1 Cassandra, 2 Kafka, 8 HTTP, 1 DNS, 1 SMTP, 2 NTP, 5 IRC, 1 Teamspeak, 1 MongoDB, 1 Samba, 1 RethinkDB, 1 CockroachDB, 2 Postgres, 3 FTP, 15 SSH, 2 FTP I Vulnerabilities I 2 CVE-2010-0426, 2 CVE-2010-0426, 1 CVE-2015-3306, 1 CVE-2015-5602, 1 CVE-2016-10033, 1 CVE-2017-7494, 1 CVE-2014-6271 I 5 Brute-force vulnerabilities I Operating Systems I 14 Ubuntu-20, 9 Ubuntu-14, 1 Debian 9:2, 2 Debian Wheezy, 5 Debian Jessie, 1 Kali I Traffic I FTP, SSH, IRC, SNMP, HTTP, Telnet, IRC, Postgres, MongoDB, Samba I curl, ping, tracerotue, nmap.. Attacker Client 1 Client 2 Client 3 Defender R1 Target infrastructure.
  • 36.
    18/30 Learning Capture-the-Flag Strategies:System Model 1/3 I A hacker (pentester) has T time-periods to collect flags hidden in the infrastructure. I The hacker is located at a dedicated starting position N0 and can connect to a gateway that exposes public-facing services in the infrastructure. I The hacker has a pre-defined set (cardinality ∼ 200) of network/shell commands available. Attacker Client 1 Client 2 Client 3 Defender R1 Target infrastructure.
  • 37.
    19/30 Learning Capture-the-Flag Strategies:System Model 2/3 I By execution of commands, the hacker collects information I Open ports, failed/successful exploits, vulnerabilities, costs, OS, ... I Sequences of commands can yield shell-access to nodes I Given shell access, the hacker can search for flags I Associated with each command is a cost c (execution time) and noise n (IDS alerts). I The objective is to capture all flags with the minimal cost within the fixed time horizon T. What strategy achieves this end? Attacker Client 1 Client 2 Client 3 Defender R1 Target infrastructure.
  • 38.
    19/30 Learning Capture-the-Flag Strategies:System Model 2/3 I By execution of commands, the hacker collects information I Open ports, failed/successful exploits, vulnerabilities, costs, OS, ... I Sequences of commands can yield shell-access to nodes I Given shell access, the hacker can search for flags I Associated with each command is a cost c (execution time) and noise n (IDS alerts). I The objective is to capture all flags with the minimal cost within the fixed time horizon T. What strategy achieves this end? Attacker Client 1 Client 2 Client 3 Defender R1 Target infrastructure.
  • 39.
    20/30 Learning Capture-the-Flag Strategies:System Model 3/3 I Contextual Stochastic CTF with Partial Information I Model infrastructure as a graph G = hN, Ei I There are k flags at nodes C ⊆ N I Ni ∈ N has a node state si of m attributes I Network state s = {sA, si | i ∈ N} ∈ R|N|×m+|N| I Hacker observes oA ⊂ s I Action space: A = {aA 1 , . . . , aA k }, aA i (commands) I ∀(b, a) ∈ A × S, there is a probability ~ w A,(x) i,j of failure a probability of detection ϕ(det(si ) · n A,(x) i,j ) I State transitions s → s0 are decided by a discrete dynamical system s0 = F(s, a) I Exact dynamics (F, cA , nA , wA , det(·), ϕ(·)), are unknown to us! N0, ~ S0 N1, ~ S1 N2, ~ S2 N3, ~ S3 N4, ~ S4 N5, ~ S5 N6, ~ S6 N7, ~ S7 ~ cA 2,3, ~ nA 2,3, ~ wA 2,3 ~ cA 3,2, ~ nA 3,2, ~ wA 3,2 ~ cA 2,1, ~ nA 2,1, ~ wA 2,1 ~ cA 1,2, ~ nA 1,2, ~ wA 1,2 ~ c A 0,1 , ~ n A 0,1 , ~ w A 0,1 ~ cA 0,2, ~ nA 0,2, ~ wA 0,2 ~ cA 3,6, ~ nA 3,6, ~ wA 3,6 ~ cA 6,7, ~ nA 6,7, ~ wA 6,7 ~ cA 7,4, ~ nA 7,4, ~ wA 7,4 ~ c A 4 ,6 , ~ n A 4 ,6 , ~ w A 4 ,6 ~ c A 4 ,5 , ~ n A 4 ,5 , ~ w A 4 ,5 ~ c A 5 ,4 , ~ n A 5 ,4 , ~ w A 5 ,4 ~ cA 1,5, ~ nA 1,5, ~ wA 1,5 ~ c A 5,2 , ~ n A 5,2 , ~ w A 5,2 ~ c A 2, 4 , ~ n A 2, 4 , ~ w A 2, 4 ~ c A 1,4 , ~ n A 1,4 , ~ w A 1,4 Si ∈ Rm, ~ cA i,j ∈ Rk + ~ nA i,j ∈ Rk +, ~ wA i,j ∈ [0, 1]k ~ cD ∈ Rl +, ~ wD ∈ [0, 1]l Graphical Model.
  • 40.
    20/30 Learning Capture-the-Flag Strategies:System Model 3/3 I Contextual Stochastic CTF with Partial Information I Model infrastructure as a graph G = hN, Ei I There are k flags at nodes C ⊆ N I Ni ∈ N has a node state si of m attributes I Network state s = {sA, si | i ∈ N} ∈ R|N|×m+|N| I Hacker observes oA ⊂ s I Action space: A = {aA 1 , . . . , aA k }, aA i (commands) I ∀(b, a) ∈ A × S, there is a probability ~ w A,(x) i,j of failure a probability of detection ϕ(det(si ) · n A,(x) i,j ) I State transitions s → s0 are decided by a discrete dynamical system s0 = F(s, a) I Exact dynamics (F, cA , nA , wA , det(·), ϕ(·)), are unknown to us! N0, ~ S0 N1, ~ S1 N2, ~ S2 N3, ~ S3 N4, ~ S4 N5, ~ S5 N6, ~ S6 N7, ~ S7 ~ cA 2,3, ~ nA 2,3, ~ wA 2,3 ~ cA 3,2, ~ nA 3,2, ~ wA 3,2 ~ cA 2,1, ~ nA 2,1, ~ wA 2,1 ~ cA 1,2, ~ nA 1,2, ~ wA 1,2 ~ c A 0,1 , ~ n A 0,1 , ~ w A 0,1 ~ cA 0,2, ~ nA 0,2, ~ wA 0,2 ~ cA 3,6, ~ nA 3,6, ~ wA 3,6 ~ cA 6,7, ~ nA 6,7, ~ wA 6,7 ~ cA 7,4, ~ nA 7,4, ~ wA 7,4 ~ c A 4 ,6 , ~ n A 4 ,6 , ~ w A 4 ,6 ~ c A 4 ,5 , ~ n A 4 ,5 , ~ w A 4 ,5 ~ c A 5 ,4 , ~ n A 5 ,4 , ~ w A 5 ,4 ~ cA 1,5, ~ nA 1,5, ~ wA 1,5 ~ c A 5,2 , ~ n A 5,2 , ~ w A 5,2 ~ c A 2, 4 , ~ n A 2, 4 , ~ w A 2, 4 ~ c A 1,4 , ~ n A 1,4 , ~ w A 1,4 Si ∈ Rm, ~ cA i,j ∈ Rk + ~ nA i,j ∈ Rk +, ~ wA i,j ∈ [0, 1]k ~ cD ∈ Rl +, ~ wD ∈ [0, 1]l Graphical Model.
  • 41.
    21/30 Learning Capture-the-Flag Strategies 050 100 150 200 250 300 350 400 # Iteration 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Avg Episode Regret Episodic regret Generated Simulation Test Emulation Env Train Emulation Env lower bound π∗ Learning curves (simulation and emulation) of our proposed method. Attacker Client 1 Client 2 Client 3 Defender R1 Evaluation infrastructure.
  • 42.
    22/30 Learning Capture-the-Flag Strategies 050 100 150 200 250 300 350 400 # Iteration −0.5 0.0 0.5 1.0 Episodic rewards 0 50 100 150 200 250 300 350 400 # Iteration 0 5 10 15 20 Episodic regret 0 50 100 150 200 250 300 350 400 # Iteration 4 6 8 10 12 14 Episodic steps Configuration 1 Configuration 2 Configuration 3 π∗ R1 alerts Gateway 172.18.4.0/24 R1 alerts Gateway 172.18.3.0/24 R1 Gateway alerts 172.18.2.0/24 Application server Intrusion detection system R1 Gateway Access switch Flag Traffic generator Configuration 1 Configuration 3 Configuration 2
  • 43.
    23/30 Learning to DetectNetwork Intrusions: Target Infrastructure I Topology: I 6 Servers, 1 IDS (Snort), 3 Clients I Services I 3 SSH, 2 HTTP, 1 DNS, 1 Telnet, 1 FTP, 1 MongoDB, 2 SMTP, 1 Tomcat, 1 Teamspeak3, 1 SNMP, 1 IRC, 1 Postgres, 1 NTP I Vulnerabilities I 1 CVE-2010-0426, 3 Brute-force vulnerabilities I Operating Systems I 4 Ubuntu-20, 1 Ubuntu-14, 1 Kali I Traffic I FTP, SSH, IRC, SNMP, HTTP, Telnet, IRC, Postgres, MongoDB, I curl, ping, tracerotue, nmap.. Attacker Client 1 Client 2 Client 3 Defender R1 Evaluation infrastructure.
  • 44.
    24/30 Learning to DetectNetwork Intrusions: System Model (1/3) I An admin should manage the infrastructure for T time-periods. I The admin can monitor the infrastructure to get a belief about it’s state bt I b1, . . . , bT−1 can be assumed to be generated from some unknown distribution ϕ. I If the admin suspects that the infrastructure is being intruded based on bt, he can suspend the suspicious user/traffic. Attacker Client 1 Client 2 Client 3 Defender R1 Target infrastructure.
  • 45.
    24/30 Learning to DetectNetwork Intrusions: System Model (1/3) I An admin should manage the infrastructure for T time-periods. I The admin can monitor the infrastructure to get a belief about it’s state bt I b1, . . . , bT−1 can be assumed to be generated from some unknown distribution ϕ. I If the admin suspects that the infrastructure is being intruded based on bt, he can suspend the suspicious user/traffic. Attacker Client 1 Client 2 Client 3 Defender R1 Target infrastructure.
  • 46.
    25/30 Learning to DetectNetwork Intrusions: System Model (2/3) I Suspending traffic from a true intrusion yields a reward r (salary bonus) I Not suspending traffic of a true intrusion, incurs a cost c (admin is fired) I Suspending traffic of a false intrusion, incurs a cost of o (breaking the SLA) I The objective is to to decide an optimal response for suspending network traffic. What strategy achieves this end? Attacker Client 1 Client 2 Client 3 Defender R1 Target infrastructure.
  • 47.
    25/30 Learning to DetectNetwork Intrusions: System Model (2/3) I Suspending traffic from a true intrusion yields a reward r (salary bonus) I Not suspending traffic of a true intrusion, incurs a cost c (admin is fired) I Suspending traffic of a false intrusion, incurs a cost of o (breaking the SLA) I The objective is to to decide an optimal response for suspending network traffic. What strategy achieves this end? Attacker Client 1 Client 2 Client 3 Defender R1 Target infrastructure.
  • 48.
    26/30 Learning to DetectNetwork Intrusions: System Model (3/3) I Optimal Stopping Problem I Action space A = {STOP, CONTINUE} I Belief state space B ∈ R8+10·m I A belief state b ∈ B contains relevant metrics to detect intrusions I Alerts from IDS, Entries in /var/log/auth, logged in users, TCP connections, processes, ... I Reward function R I r(bt, STOP, st) = 1intrusion β ti I β is a positive constant and ti is the number of nodes compromised by the attacker I =⇒ incentive to detect intrusion early. Attacker Client 1 Client 2 Client 3 Defender R1 Target infrastructure.
  • 49.
    27/30 Structural Properties ofthe Optimal Policy I Assumptions: Always an intrusion before T, f (bt): probability of intrusion given bt, bt and p are Markov, f (bt) is non-decreasing in t. I Claim: Optimal policy is a threshold based policy I Necessary condition for optimality (Bellman): ut(bt) = sup a rt(bt, a) + X b0∈B pt(b0 |bt, a)ut+1(b0 , a) # (1) I Thus I have that it is optimal to stop at state bt iff f (bt) · β ti ≥ X b0∈B ϕ(b0 )ut+1(b0 ) (2) I Stopping threshold αt: αt , ti β X b0∈B ϕ(b0 )ut+1(b0 ) (3)
  • 50.
    27/30 Structural Properties ofthe Optimal Policy I Assumptions: Always an intrusion before T, f (bt): probability of intrusion given bt, bt and p are Markov, f (bt) is non-decreasing in t. I Claim: Optimal policy is a threshold based policy I Necessary condition for optimality (Bellman): ut(bt) = sup a rt(bt, a) + X b0∈B pt(b0 |bt, a)ut+1(b0 , a) # (4) = max f (bt) · β ti , X b0∈B ϕ(b0 )ut+1(b0 ) # (5) I Thus I have that it is optimal to stop at state bt iff f (bt) · β ti ≥ X b0∈B ϕ(b0 )ut+1(b0 ) (6) I Stopping threshold αt: αt , ti β X b0∈B ϕ(b0 )ut+1(b0 ) (7)
  • 51.
    27/30 Structural Properties ofthe Optimal Policy I Assumptions: Always an intrusion before T, f (bt): probability of intrusion given bt, bt and p are Markov, f (bt) is non-decreasing in t. I Claim: Optimal policy is a threshold based policy I Necessary condition for optimality (Bellman): ut(bt) = sup a rt(bt, a) + X b0∈B pt(b0 |bt, a)ut+1(b0 , a) # (8) = max f (bt) · β ti , X b0∈B ϕ(b0 )ut+1(b0 ) # (9) I Thus I have that it is optimal to stop at state bt iff f (bt) · β ti ≥ X b0∈B ϕ(b0 )ut+1(b0 ) (10) I Stopping threshold αt: αt , ti β X b0∈B ϕ(b0 )ut+1(b0 ) (11)
  • 52.
    27/30 Structural Properties ofthe Optimal Policy I Assumptions: Always an intrusion before T, f (bt): probability of intrusion given bt, bt and p are Markov, f (bt) is non-decreasing in t. I Claim: Optimal policy is a threshold based policy I Necessary condition for optimality (Bellman): ut(bt) = sup a rt(bt, a) + X b0∈B pt(b0 |bt, a)ut+1(b0 , a) # (12) = max f (bt) · β ti , X b0∈B ϕ(b0 )ut+1(b0 ) # (13) I Thus I have that it is optimal to stop at state bt iff f (bt) · β ti ≥ X b0∈B ϕ(b0 )ut+1(b0 ) (14) I Stopping threshold αt: αt , ti β X b0∈B ϕ(b0 )ut+1(b0 ) (15)
  • 53.
    28/30 Learning to DetectNetwork Intrusions 0 25 50 75 100 125 150 175 200 # Iteration −2 0 2 4 6 8 10 Avg Episode Rewards Episodic rewards πθ simulation πθ emulation Snort-Severe simulation Snort-warning simulation Snort-critical simulation /var/log/auth simulation Snort-Severe emulation Snort-warning emulation Snort-critical emulation /var/log/auth emulation upper bound π∗ Learning curves (simulation and emulation) of our proposed method. Attacker Client 1 Client 2 Client 3 Defender R1 Evaluation infrastructure.
  • 54.
    29/30 Learning to DetectNetwork Intrusions 0 25 50 75 100 125 150 175 200 # Iteration 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Fraction TP (intrusion detected), FP (early stopping), FN (intrusion) True Positive (TP) False Positive (FP) False Negative (FN) max Trade-off between detection and false positives
  • 55.
    30/30 Conclusions FutureWork I Conclusions: I We develop a method to find effective strategies for intrusion prevention I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement learning and (5) domain randomization and generalization. I We show that self-learning can be successfully applied to network infrastructures. I Self-play reinforcement learning in Markov security game I Key challenges: stable convergence, sample efficiency, complexity of emulations, large state and action spaces I Our research plans: I Improving the system identification algorithm generalization I Evaluation on real world infrastructures