Self-learning systems for cyber security

1/30
Self-Learning Systems for Cyber Security
NSE Seminar
Kim Hammar & Rolf Stadler
kimham@kth.se & stadler@kth.se
Division of Network and Systems Engineering
KTH Royal Institute of Technology
April 9, 2021

4/30
Challenges: Evolving and Automated Attacks
I Challenges:
I Evolving & automated attacks
I Complex infrastructures
Attacker Client 1 Client 2 Client 3
Defender
R1

4/30
Goal: Automation and Learning
I Challenges
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
Defender
R1

4/30
Approach: Game Model & Reinforcement Learning
I Challenges:
I Our Goal:
I Automate security tasks
I Adapt to changing attack methods
I Our Approach:
I Model network attack and defense as
games.
I Use reinforcement learning to learn
policies.
I Incorporate learned policies in
self-learning systems.
Defender
R1

5/30
State of the Art
I Game-Learning Programs:
I TD-Gammon, AlphaGo Zero1
, OpenAI Five etc.
I =⇒ Impressive empirical results of RL and self-play
I Attack Simulations:
I Automated threat modeling2
, automated intrusion detection
etc.
I =⇒ Need for automation and better security tooling
I Mathematical Modeling:
I Game theory3
I Markov decision theory
I =⇒ Many security operations involves
strategic decision making
1
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017),
pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
2
Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and
Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security.
ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi:
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
3
Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA:
Cambridge University Press, 2010. isbn: 0521119324.

5/30
State of the Art
, OpenAI Five etc.
etc.
I Game theory6
4
5
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
6

5/30
State of the Art
, OpenAI Five etc.
etc.
I Game theory9
7
8
10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799.
9

6/30
Our Work
I Use Case: Intrusion Prevention
I Our Method:
I Emulating computer infrastructures
I System identification and model creation
I Reinforcement learning and generalization
I Results:
I Learning to Capture The Flag
I Learning to Detect Network Intrusions
I Conclusions and Future Work

7/30
Use Case: Intrusion Prevention
I A Defender owns an infrastructure
I Consists of connected components
I Components run network services
I Defender defends the infrastructure
by monitoring and patching
I An Attacker seeks to intrude on the
infrastructure
I Has a partial view of the
infrastructure
I Wants to compromise specific
components
I Attacks by reconnaissance,
exploitation and pivoting
Defender
R1

8/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation &
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning &
Generalization
Policy evaluation &
Model estimation
Automation &
Self-learning systems

9/30
Emulation System Σ Configuration Space
σi
*
* *
172.18.4.0/24
172.18.19.0/24
172.18.61.0/24
Emulated Infrastructures
R1 R1 R1
Emulation
A cluster of machines that runs a virtualized infrastructure
which replicates important functionality of target systems.
I The set of virtualized configurations define a
configuration space Σ = hA, O, S, U, T , Vi.
I A specific emulation is based on a configuration σi ∈ Σ.

10/30
Emulation: Execution Times of Replicated Operations
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
Normalized
Frequency
Action execution times (costs)
|N| = 25
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
|N| = 50
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
|N| = 75
0 500 1000 1500 2000
Time Cost (s)
10−5
10−4
10−3
10−2
|N| = 100
I Fundamental issue: Computational methods for policy
learning typically require samples on the order of 100k − 10M.
I =⇒ Infeasible to optimize in the emulation system

11/30
From Emulation to Simulation: System Identification
R1
m1
m2 m3
m4
m5
m6
m7
m1,1 . . . m1,k
h i
m5,1 . . . m5,k
h i
m6,1 . . . m6,k
h i
m2,1
.
.
.
m2,k








m3,1
.
.
.
m3,k








m7,1
.
.
.
m7,k








m4,1
.
.
.
m4,k








Emulated Network Abstract Model POMDP Model
hS, A, P, R, γ, O, Zi
a1 a2 a3 . . .
s1 s2 s3 . . .
o1 o2 o3 . . .
I Abstract Model Based on Domain Knowledge: Models
the set of controls, the objective function, and the features of
the emulated network.
I Defines the static parts a POMDP model.
I Dynamics Model (P, Z) Identified using System
Identification: Algorithm based on random walks and
maximum-likelihood estimation.
M(b0
|b, a) ,
n(b, a, b0)
P
j0 n(s, a, j0)

12/30
System Identification: Estimated Dynamics Model
172.18.4.2
Connections Failed Logins Accounts Online Users Logins Processes
172.18.4.3
172.18.4.10
172.18.4.21
172.18.4.79
IDS
Alerts Alert Priorities Severe Alerts Warning Alerts
Estimated Emulation Dynamics

13/30
−5 0 5 10 15 20
# New TCP/UDP Connections
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New TCP/UDP Connections
0 10 20 30 40 50
# New Failed Login Attempts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Failed Login Attempts
−30 −20 −10 0 10 20 30
# Created User Accounts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Created User Accounts
−0.04 −0.02 0.00 0.02 0.04
# New Logged in Users
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Logged in Users
−0.04 −0.02 0.00 0.02 0.04
# Login Events
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
New Login Events
0 20 40 60 80 100 120 140
# Created Processes
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Created Processes
Node IP: 172.18.4.2
(b0, a0) (b1, a0) ...

14/30
0 20 40 60 80 100 120
# IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
IDS Alerts
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
# Severe IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Severe IDS Alerts
0 20 40 60 80 100
# Warning IDS Alerts
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
Warning IDS Alerts
0 50 100 150 200 250
# IDS Alert Priorities
0.0
0.2
0.4
0.6
0.8
1.0
P[·|(b
i
,
a
i
)]
IDS Alert Priorities
IDS Dynamics
(b0, a0) (b1, a0) ...

15/30
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
hPT
t=0 γt
rt+1
i
I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eo∼ρπθ ,a∼πθ
[R]
I Maximize J(θ) by stochastic gradient ascent with
gradient
∇θJ(θ) = Eo∼ρπθ ,a∼πθ
[∇θ log πθ(a|o)Aπθ
(o, a)]
I Domain-Specific Challenges:
I Partial observability
I Large state space |S| = (w + 1)|N|·m·(m+1)
I Large action space |A| = |N| · (m + 1)
I Non-stationary Environment due to presence of
adversary
I Generalization
Agent
Environment
at
st+1
rt+1

15/30
Policy Optimization in the Simulation System
using Reinforcement Learning
I Goal:
I Approximate π∗
= arg maxπ E
PT
t=0
γt
rt+1

I Learning Algorithm:
I Represent π by πθ
I Define objective J(θ) = Eo∼ρπθ ,a∼πθ
[R]
I Maximize J(θ) by stochastic gradient ascent with gradient
∇θJ(θ) = Eo∼ρπθ ,a∼πθ
[∇θ log πθ(a|o)Aπθ (o, a)]
I Domain-Specific Challenges:
I Partial observability
I Large state space |S| = (w + 1)|N |·m·(m+1)
I Large action space |A| = |N | · (m + 1)
I Non-stationary Environment due to presence of adversary
I Generalization
I Finding Effective Security Strategies through
Reinforcement Learning and Self-Playa
a
Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and
Self-Play”. In: International Conference on Network and Service Management (CNSM 2020) (CNSM 2020). Izmir,
Turkey, Nov. 2020.
Agent
Environment
at
st+1
rt+1

16/30
Our Method for Finding Effective Security Strategies
s1,1 s1,2 s1,3 . . . s1,n
s2,1 s2,2 s2,3 . . . s2,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Emulation System
Real world
Infrastructure
Model Creation
System Identification
Policy Mapping
π
Selective
Replication
Policy
Implementation π
Simulation System
Reinforcement Learning
Generalization
Policy evaluation
Model estimation
Automation
Self-learning systems

17/30
Learning Capture-the-Flag Strategies: Target Infrastructure
I Topology:
I 32 Servers, 1 IDS (Snort), 3 Clients
I Services
I 1 SNMP, 1 Cassandra, 2 Kafka, 8 HTTP, 1 DNS, 1 SMTP, 2 NTP, 5
IRC, 1 Teamspeak, 1 MongoDB, 1 Samba, 1 RethinkDB, 1
CockroachDB, 2 Postgres, 3 FTP, 15 SSH, 2 FTP
I Vulnerabilities
I 2 CVE-2010-0426, 2 CVE-2010-0426, 1 CVE-2015-3306, 1
CVE-2015-5602, 1 CVE-2016-10033, 1 CVE-2017-7494, 1
CVE-2014-6271
I 5 Brute-force vulnerabilities
I Operating Systems
I 14 Ubuntu-20, 9 Ubuntu-14, 1 Debian 9:2, 2 Debian Wheezy, 5
Debian Jessie, 1 Kali
I Traffic
I FTP, SSH, IRC, SNMP, HTTP, Telnet, IRC, Postgres, MongoDB,
Samba
I curl, ping, tracerotue, nmap..
Defender
R1
Target infrastructure.

18/30
Learning Capture-the-Flag Strategies: System Model 1/3
I A hacker (pentester) has T time-periods to
collect flags hidden in the infrastructure.
I The hacker is located at a dedicated starting
position N0 and can connect to a gateway
that exposes public-facing services in the
infrastructure.
I The hacker has a pre-defined set
(cardinality ∼ 200) of network/shell
commands available.
Defender
R1

19/30
I By execution of commands, the hacker
collects information
I Open ports, failed/successful exploits,
vulnerabilities, costs, OS, ...
I Sequences of commands can yield
shell-access to nodes
I Given shell access, the hacker can search
for flags
I Associated with each command is a cost c
(execution time) and noise n (IDS alerts).
I The objective is to capture all flags with the
minimal cost within the fixed time horizon T.
What strategy achieves this end?
Defender
R1

20/30
I Contextual Stochastic CTF with Partial
Information
I Model infrastructure as a graph G = hN, Ei
I There are k flags at nodes C ⊆ N
I Ni ∈ N has a node state si of m attributes
I Network state
s = {sA, si | i ∈ N} ∈ R|N|×m+|N|
I Hacker observes oA
⊂ s
I Action space: A = {aA
1 , . . . , aA
k }, aA
i
(commands)
I ∀(b, a) ∈ A × S, there is a probability ~
w
A,(x)
i,j of
failure a probability of detection
ϕ(det(si ) · n
A,(x)
i,j )
I State transitions s → s0
are decided by a
discrete dynamical system s0
= F(s, a)
I Exact dynamics (F, cA
, nA
, wA
, det(·), ϕ(·)),
are unknown to us!
N0, ~
S0
N1, ~
S1
N2, ~
S2
N3, ~
S3
N4, ~
S4
N5, ~
S5 N6, ~
S6
N7, ~
S7
~
cA
2,3, ~
nA
2,3, ~
wA
2,3
~
cA
3,2, ~
nA
3,2, ~
wA
3,2
~
cA
2,1, ~
nA
2,1, ~
wA
2,1
~
cA
1,2, ~
nA
1,2, ~
wA
1,2
~
c
A
0,1
,
~
n
A
0,1
,
~
w
A
0,1
~
cA
0,2, ~
nA
0,2, ~
wA
0,2
~
cA
3,6, ~
nA
3,6, ~
wA
3,6
~
cA
6,7, ~
nA
6,7, ~
wA
6,7
~
cA
7,4, ~
nA
7,4, ~
wA
7,4
~
c
A
4
,6
,
~
n
A
4
,6
,
~
w
A
4
,6
~
c
A
4
,5
,
~
n
A
4
,5
,
~
w
A
4
,5
~
c
A
5
,4
,
~
n
A
5
,4
,
~
w
A
5
,4
~
cA
1,5, ~
nA
1,5, ~
wA
1,5
~
c
A
5,2
,
~
n
A
5,2
,
~
w
A
5,2
~
c
A
2,
4
,
~
n
A
2,
4
,
~
w
A
2,
4
~
c A
1,4 , ~
n A
1,4 , ~
w A
1,4
Si ∈ Rm, ~
cA
i,j ∈ Rk
+
~
nA
i,j ∈ Rk
+, ~
wA
i,j ∈ [0, 1]k
~
cD ∈ Rl
+, ~
wD ∈ [0, 1]l
Graphical Model.

21/30
Learning Capture-the-Flag Strategies
0 50 100 150 200 250 300 350 400
# Iteration
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
Avg
Episode
Regret
Episodic regret
Generated Simulation
Test Emulation Env
Train Emulation Env
lower bound π∗
Learning curves (simulation and emulation) of our
proposed method.
Defender
R1
Evaluation infrastructure.

22/30
Learning Capture-the-Flag Strategies
0 50 100 150 200 250 300 350 400
# Iteration
−0.5
0.0
0.5
1.0
Episodic rewards
0 50 100 150 200 250 300 350 400
# Iteration
0
5
10
15
20
Episodic regret
0 50 100 150 200 250 300 350 400
# Iteration
4
6
8
10
12
14
Episodic steps
Configuration 1 Configuration 2 Configuration 3 π∗
R1
alerts
Gateway
172.18.4.0/24
R1
alerts
Gateway
172.18.3.0/24
R1
Gateway
alerts 172.18.2.0/24
Application
server
Intrusion
detection
system
R1
Gateway
Access
switch
Flag Traffic
generator
Configuration 1 Configuration 3
Configuration 2

23/30
Learning to Detect Network Intrusions: Target
Infrastructure
I Topology:
I 6 Servers, 1 IDS (Snort), 3 Clients
I Services
I 3 SSH, 2 HTTP, 1 DNS, 1 Telnet, 1 FTP, 1 MongoDB, 2
SMTP, 1 Tomcat, 1 Teamspeak3, 1 SNMP, 1 IRC, 1 Postgres,
1 NTP
I Vulnerabilities
I 1 CVE-2010-0426, 3 Brute-force vulnerabilities
I Operating Systems
I 4 Ubuntu-20, 1 Ubuntu-14, 1 Kali
I Traffic
I FTP, SSH, IRC, SNMP, HTTP, Telnet, IRC, Postgres,
MongoDB,
I curl, ping, tracerotue, nmap..
Defender
R1

24/30
Learning to Detect Network Intrusions: System Model
(1/3)
I An admin should manage the
infrastructure for T time-periods.
I The admin can monitor the
infrastructure to get a belief about it’s
state bt
I b1, . . . , bT−1 can be assumed to be
generated from some unknown
distribution ϕ.
I If the admin suspects that the
infrastructure is being intruded based on
bt, he can suspend the suspicious
user/traffic.
Defender
R1

25/30
(2/3)
I Suspending traffic from a true intrusion
yields a reward r (salary bonus)
I Not suspending traffic of a true
intrusion, incurs a cost c (admin is fired)
I Suspending traffic of a false intrusion,
incurs a cost of o (breaking the SLA)
I The objective is to to decide an optimal
response for suspending network traffic.
What strategy achieves this end?
Defender
R1

26/30
(3/3)
I Optimal Stopping Problem
I Action space A = {STOP, CONTINUE}
I Belief state space B ∈ R8+10·m
I A belief state b ∈ B contains relevant
metrics to detect intrusions
I Alerts from IDS, Entries in
/var/log/auth, logged in users, TCP
connections, processes, ...
I Reward function R
I r(bt, STOP, st) = 1intrusion
β
ti
I β is a positive constant and ti is the
number of nodes compromised by the
attacker
I =⇒ incentive to detect intrusion early.
Defender
R1

27/30
Structural Properties of the Optimal Policy
I Assumptions: Always an intrusion before T, f (bt): probability of
intrusion given bt, bt and p are Markov, f (bt) is non-decreasing in t.
I Claim: Optimal policy is a threshold based policy
I Necessary condition for optimality (Bellman):
ut(bt) = sup
a

rt(bt, a) +
X
b0∈B
pt(b0
|bt, a)ut+1(b0
, a)
#
(1)
I Thus I have that it is optimal to stop at state bt iff
f (bt) ·
β
ti
≥
X
b0∈B
ϕ(b0
)ut+1(b0
) (2)
I Stopping threshold αt:
αt ,
ti
β
X
b0∈B
ϕ(b0
)ut+1(b0
) (3)

27/30
ut(bt) = sup
a

rt(bt, a) +
X
b0∈B
pt(b0
|bt, a)ut+1(b0
, a)
#
(4)
= max

f (bt) ·
β
ti
,
X
b0∈B
ϕ(b0
)ut+1(b0
)
#
(5)
f (bt) ·
β
ti
≥
X
b0∈B
ϕ(b0
)ut+1(b0
) (6)
αt ,
ti
β
X
b0∈B
ϕ(b0
)ut+1(b0
) (7)

27/30
ut(bt) = sup
a

rt(bt, a) +
X
b0∈B
pt(b0
|bt, a)ut+1(b0
, a)
#
(8)
= max

f (bt) ·
β
ti
,
X
b0∈B
ϕ(b0
)ut+1(b0
)
#
(9)
f (bt) ·
β
ti
≥
X
b0∈B
ϕ(b0
)ut+1(b0
) (10)
αt ,
ti
β
X
b0∈B
ϕ(b0
)ut+1(b0
) (11)

27/30
ut(bt) = sup
a

rt(bt, a) +
X
b0∈B
pt(b0
|bt, a)ut+1(b0
, a)
#
(12)
= max

f (bt) ·
β
ti
,
X
b0∈B
ϕ(b0
)ut+1(b0
)
#
(13)
f (bt) ·
β
ti
≥
X
b0∈B
ϕ(b0
)ut+1(b0
) (14)
αt ,
ti
β
X
b0∈B
ϕ(b0
)ut+1(b0
) (15)

28/30
Learning to Detect Network Intrusions
0 25 50 75 100 125 150 175 200
# Iteration
−2
0
2
4
6
8
10
Avg
Episode
Rewards
Episodic rewards
πθ simulation
πθ emulation
Snort-Severe simulation
Snort-warning simulation
Snort-critical simulation
/var/log/auth simulation
Snort-Severe emulation
Snort-warning emulation
Snort-critical emulation
/var/log/auth emulation
upper bound π∗
Learning curves (simulation and emulation)
of our proposed method.
Defender
R1

29/30
Learning to Detect Network Intrusions
0 25 50 75 100 125 150 175 200
# Iteration
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Fraction
TP (intrusion detected), FP (early stopping), FN (intrusion)
True Positive (TP)
False Positive (FP)
False Negative (FN)
max
Trade-off between detection and false positives

30/30
Conclusions Future Work
I Conclusions:
I We develop a method to find effective strategies for intrusion
prevention
I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement
learning and (5) domain randomization and generalization.
I We show that self-learning can be successfully applied to
network infrastructures.
I Self-play reinforcement learning in Markov security game
I Key challenges: stable convergence, sample efficiency,
complexity of emulations, large state and action spaces
I Our research plans:
I Improving the system identification algorithm generalization
I Evaluation on real world infrastructures

Self-learning systems for cyber security

More Related Content

What's hot

Similar to Self-learning systems for cyber security

More from Kim Hammar

Recently uploaded

Self-learning systems for cyber security