Self-Learning Systems for Cyber Security

1.
Self-Learning Systems forCyber Security Kim Hammar & Rolf Stadler kimham@kth.se, stadler@kth.se KTH Royal Institute of Technology CDIS Spring Conference 2021 March 24, 2021 1/16

2.
2/16

3.
3/16

4.
4/16 Challenges: Evolving andAutomated Attacks I Challenges: I Evolving & automated attacks I Complex infrastructures Attacker Client 1 Client 2 Client 3 Defender R1

5.
4/16 Goal: Automation andLearning I Challenges I Evolving & automated attacks I Complex infrastructures I Our Goal: I Automate security tasks I Adapt to changing attack methods Attacker Client 1 Client 2 Client 3 Defender R1

6.
4/16 Approach: Game Model& Reinforcement Learning I Challenges: I Evolving & automated attacks I Complex infrastructures I Our Goal: I Automate security tasks I Adapt to changing attack methods I Our Approach: I Model network attack and defense as games. I Use reinforcement learning to learn policies. I Incorporate learned policies in self-learning systems. Attacker Client 1 Client 2 Client 3 Defender R1

7.
5/16 State of theArt I Game-Learning Programs: I TD-Gammon, AlphaGo Zero1 , OpenAI Five etc. I =⇒ Impressive empirical results of RL and self-play I Attack Simulations: I Automated threat modeling2 , automated intrusion detection etc. I =⇒ Need for automation and better security tooling I Mathematical Modeling: I Game theory3 I Markov decision theory I =⇒ Many security operations involves strategic decision making 1 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 2 Pontus Johnson, Robert Lagerström, and Mathias Ekstedt. “A Meta Language for Threat Modeling and Attack Simulations”. In: Proceedings of the 13th International Conference on Availability, Reliability and Security. ARES 2018. Hamburg, Germany: Association for Computing Machinery, 2018. isbn: 9781450364485. doi: 10.1145/3230833.3232799. url: https://doi.org/10.1145/3230833.3232799. 3 Tansu Alpcan and Tamer Basar. Network Security: A Decision and Game-Theoretic Approach. 1st. USA: Cambridge University Press, 2010. isbn: 0521119324.

8.

9.

10.
6/16 Our Work I UseCase: Intrusion Prevention I Our Method: I Emulating computer infrastructures I System identification and model creation I Reinforcement learning and generalization I Results: Learning to Capture The Flag I Conclusions and Future Work

11.
7/16 Use Case: IntrusionPrevention I A Defender owns an infrastructure I Consists of connected components I Components run network services I Defender defends the infrastructure by monitoring and patching I An Attacker seeks to intrude on the infrastructure I Has a partial view of the infrastructure I Wants to compromise specific components I Attacks by reconnaissance, exploitation and pivoting Attacker Client 1 Client 2 Client 3 Defender R1

12.
8/16 Our Method forFinding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation & System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning & Generalization Policy evaluation & Model estimation Automation & Self-learning systems

13.

14.

15.

16.

17.

18.

19.

20.

21.
9/16 Emulation System ΣConfiguration Space σi * * * 172.18.4.0/24 172.18.19.0/24 172.18.61.0/24 Emulated Infrastructures R1 R1 R1 Emulation A cluster of machines that runs a virtualized infrastructure which replicates important functionality of target systems. I The set of virtualized configurations define a configuration space Σ = hA, O, S, U, T , Vi. I A specific emulation is based on a configuration σi ∈ Σ.

22.
9/16 Emulation System ΣConfiguration Space σi * * * 172.18.4.0/24 172.18.19.0/24 172.18.61.0/24 Emulated Infrastructures R1 R1 R1 Emulation A cluster of machines that runs a virtualized infrastructure which replicates important functionality of target systems. I The set of virtualized configurations define a configuration space Σ = hA, O, S, U, T , Vi. I A specific emulation is based on a configuration σi ∈ Σ.

23.
10/16 Emulation: Execution Timesof Replicated Operations 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Normalized Frequency Action execution times (costs) |N| = 25 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Action execution times (costs) |N| = 50 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Action execution times (costs) |N| = 75 0 500 1000 1500 2000 Time Cost (s) 10−5 10−4 10−3 10−2 Action execution times (costs) |N| = 100 I Fundamental issue: Computational methods for policy learning typically require samples on the order of 100k − 10M. I =⇒ Infeasible to optimize in the emulation system

24.
11/16 From Emulation toSimulation: System Identification R1 m1 m2 m3 m4 m5 m6 m7 m1,1 . . . m1,k h i m5,1 . . . m5,k h i m6,1 . . . m6,k h i m2,1 . . . m2,k         m3,1 . . . m3,k         m7,1 . . . m7,k         m4,1 . . . m4,k         Emulated Network Abstract Model POMDP Model hS, A, P, R, γ, O, Zi a1 a2 a3 . . . s1 s2 s3 . . . o1 o2 o3 . . . I Abstract Model Based on Domain Knowledge: Models the set of controls, the objective function, and the features of the emulated network. I Defines the static parts a POMDP model. I Dynamics Model (P, Z) Identified using System Identification: Algorithm based on random walks and maximum-likelihood estimation. M(b0 |b, a) , n(b, a, b0) P j0 n(s, a, j0)

25.

26.

27.
12/16 Policy Optimization inthe Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E hPT t=0 γt rt+1 i I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eo∼ρπθ ,a∼πθ [R] I Maximize J(θ) by stochastic gradient ascent with gradient ∇θJ(θ) = Eo∼ρπθ ,a∼πθ [∇θ log πθ(a|o)Aπθ (o, a)] I Domain-Specific Challenges: I Partial observability I Large state space |S| = (w + 1)|N|·m·(m+1) I Large action space |A| = |N| · (m + 1) I Non-stationary Environment due to presence of adversary I Generalization Agent Environment at st+1 rt+1

28.

29.

30.
12/16 Policy Optimization inthe Simulation System using Reinforcement Learning I Goal: I Approximate π∗ = arg maxπ E PT t=0 γt rt+1 I Learning Algorithm: I Represent π by πθ I Define objective J(θ) = Eo∼ρπθ ,a∼πθ [R] I Maximize J(θ) by stochastic gradient ascent with gradient ∇θJ(θ) = Eo∼ρπθ ,a∼πθ [∇θ log πθ(a|o)Aπθ (o, a)] I Domain-Specific Challenges: I Partial observability I Large state space |S| = (w + 1)|N |·m·(m+1) I Large action space |A| = |N | · (m + 1) I Non-stationary Environment due to presence of adversary I Generalization I Finding Effective Security Strategies through Reinforcement Learning and Self-Playa a Kim Hammar and Rolf Stadler. “Finding Effective Security Strategies through Reinforcement Learning and Self-Play”. In: International Conference on Network and Service Management (CNSM 2020) (CNSM 2020). Izmir, Turkey, Nov. 2020. Agent Environment at st+1 rt+1

31.
13/16 Our Method forFinding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Real world Infrastructure Model Creation System Identification Policy Mapping π Selective Replication Policy Implementation π Simulation System Reinforcement Learning Generalization Policy evaluation Model estimation Automation Self-learning systems

32.
14/16 Learning Capture-the-Flag Strategies 050 100 150 200 250 300 350 400 # Iteration 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Avg Episode Regret Episodic regret Generated Simulation Test Emulation Env Train Emulation Env lower bound π∗ Learning curves (train and eval) of our proposed method. Attacker Client 1 Client 2 Client 3 Defender R1 Evaluation infrastructure.

33.
15/16 Learning Capture-the-Flag Strategies 050 100 150 200 250 300 350 400 # Iteration −0.5 0.0 0.5 1.0 Episodic rewards 0 50 100 150 200 250 300 350 400 # Iteration 0 5 10 15 20 Episodic regret 0 50 100 150 200 250 300 350 400 # Iteration 4 6 8 10 12 14 Episodic steps Configuration 1 Configuration 2 Configuration 3 π∗ R1 alerts Gateway 172.18.4.0/24 R1 alerts Gateway 172.18.3.0/24 R1 Gateway alerts 172.18.2.0/24 Application server Intrusion detection system R1 Gateway Access switch Flag Traffic generator Configuration 1 Configuration 3 Configuration 2

34.
16/16 Conclusions FutureWork I Conclusions: I We develop a method to find effective strategies for intrusion prevention I (1) emulation system; (2) system identification; (3) simulation system; (4) reinforcement learning and (5) domain randomization and generalization. I We show that self-learning can be successfully applied to network infrastructures. I Self-play reinforcement learning in Markov security game I Key challenges: stable convergence, sample efficiency, complexity of emulations, large state and action spaces I Our research plans: I Improving the system identification algorithm generalization I Evaluation on real world infrastructures

Self-Learning Systems for Cyber Security

More Related Content

What's hot

Similar to Self-Learning Systems for Cyber Security

More from Kim Hammar

Recently uploaded

Self-Learning Systems for Cyber Security