Learning Security Strategies through Game Play and Optimal Stopping

1. 1/27 Learning Security Strategies through Game Play and Optimal Stopping ICML 22’, Baltimore, 17/7-23/7 2022 Machine Learning for Cyber Security Workshop Kim Hammar & Rolf Stadler kimham@kth.se & stadler@kth.se Division of Network and Systems Engineering KTH Royal Institute of Technology Mar 18, 2022

2. 2/27 Use Case: Intrusion Prevention I A Defender owns an infrastructure I Consists of connected components I Components run network services I Defender defends the infrastructure by monitoring and active defense I Has partial observability I An Attacker seeks to intrude on the infrastructure I Has a partial view of the infrastructure I Wants to compromise specific components I Attacks by reconnaissance, exploitation and pivoting Attacker Clients . . . Defender 1 IPS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31

3. 3/27 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins

11. 3/27 The Intrusion Prevention Problem 20 40 60 80 100 120 140 160 180 200 0.5 1 t b 20 40 60 80 100 120 140 160 180 200 50 100 150 200 t Alerts 20 40 60 80 100 120 140 160 180 200 5 10 15 t Logins When to take a defensive action? Which action to take?

12. 4/27 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s)

13. 4/27 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Reference points Intrusion prevention milestones

14. 4/27 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Audit logs and manual detection/prevention (1980s) Reference points Intrusion prevention milestones

15. 4/27 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Audit logs and manual detection/prevention (1980s) Rule-based IDS/IPS (1990s) Reference points Intrusion prevention milestones

16. 4/27 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Audit logs and manual detection/prevention (1980s) Rule-based IDS/IPS (1990s) Rule-based & Statistical IDS/IPS (2000s) Reference points Intrusion prevention milestones

17. 4/27 A Brief History of Intrusion Prevention ARPANET (1969) Internet (1980s) Manual detection/ prevention (1980s) Audit logs and manual detection/prevention (1980s) Rule-based IDS/IPS (1990s) Rule-based & Statistical IDS/IPS (2000s) Research: ML-based IDS RL-based IPS Control-based IPS Computational game theory IPS (2010-present) Reference points Intrusion prevention milestones

18. 5/27 Our Approach for Learning Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Strategy Mapping π Selective Replication Strategy Implementation π Simulation System Reinforcement Learning & Generalization Strategy evaluation & Model estimation Automation & Self-learning systems

19. 5/27 Our Approach for Finding Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Strategy Mapping π Selective Replication Strategy Implementation π Simulation System Reinforcement Learning & Generalization Strategy evaluation & Model estimation Automation & Self-learning systems

26. 6/27 Outline I Use Case & Approach: I Use case: Intrusion prevention I Approach: Emulation, simulation, and reinforcement learning I Game-Theoretic Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Existence of Nash Equilibria I Structural result: multi-threshold best responses I Our Method for Learning Equilibrium Strategies I Our method for emulating the target infrastructure I Our system identification algorithm I Our reinforcement learning algorithm: T-FP I Results & Conclusion I Numerical evaluation results, conclusion, and future work

31. 7/27 The Optimal Stopping Game I Defender: I Has a pre-defined ordered list of defensive measures: 1. Revoke user certificates 2. Blacklist IPs 3. Drop traffic that generates IPS alerts of priority 1 − 4 4. Block gateway I Defender’s strategy decides when to take each action I Attacker: I Has a pre-defined randomized intrusion sequence of reconnaissance and exploit commands: 1. TCP-SYN scan 2. CVE-2017-7494 3. CVE-2015-3306 4. CVE-2015-5602 5. SSH brute-force 6. . . . I Attacker’s strategy decides when to start/stop an intrusion Attacker Clients . . . Defender 1 IPS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31

32. 7/27 The Optimal Stopping Game I Defender: I Has a pre-defined ordered list of defensive measures: 1. Revoke user certificates 2. Blacklist IPs 3. Drop traffic that generates IPS alerts of priority 1 − 4 4. Block gateway I Defender’s strategy decides when to take each action I Attacker: I Has a pre-defined randomized intrusion sequence of reconnaissance and exploit commands: 1. TCP-SYN scan 2. CVE-2017-7494 3. CVE-2015-3306 4. CVE-2015-5602 5. SSH brute-force 6. . . . I Attacker’s stratregy decides when to start/stop an intrusion Attacker Clients . . . Defender 1 IPS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31 We analyze attacker/defender strategies using optimal stopping theory

33. 8/27 Optimal Stopping Formulation of Intrusion Prevention Attacker Defender t = 1 t I The attacker’s stopping times τ2,1 and τ2,2 determine the times to start/stop the intrusion I During the intrusion, the attacker follows a fixed intrusion strategy I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the times to take defensive actions

34. 8/27 Optimal Stopping Formulation of Intrusion Prevention Attacker Defender t = 1 τ2,1 t Intrusion I The attacker’s stopping times τ2,1 and τ2,2 determine the times to start/stop the intrusion I During the intrusion, the attacker follows a fixed intrusion strategy I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the times to take defensive actions

35. 8/27 Optimal Stopping Formulation of Intrusion Prevention Attacker Defender t = 1 t = T τ1,1 τ1,2 τ1,3 τ2,1 t Prevented Game episode Intrusion I The attacker’s stopping times τ2,1 and τ2,2 determine the times to start/stop the intrusion I During the intrusion, the attacker follows a fixed intrusion strategy I The defender’s stopping times τ1,L, τ1,L−1, . . . determine the times to take defensive actions

36. 8/27 Optimal Stopping Formulation of Intrusion Prevention Attacker Defender t = 1 t = T τ1,1 τ1,2 τ1,3 τ2,1 t Prevented Game episode Intrusion I The attacker’s stopping times τ2,1, τ2,2, . . . determine the times to start/stop the intrusion I During the intrusion, the attacker follows a fixed intrusion strategy I The defender’s stopping times τ1,1, τ1,2, . . . determine the times to update the IPS configuration We model this game as a zero-sum partially observed stochastic game

37. 9/27 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I Number of IPS Alerts ot ∈ O, defender stops remaining lt ∈ {1, .., L}, ot is drawn from r.v. O ∼ fO(·|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S lt = 1 a (1) t = S p r e v e n t e d w . p φ l t o r s t o p p e d ( a ( 2 ) t = S )

44. 9/27 Partially Observed Stochastic Game I Players: N = {1, 2} (Defender=1) I States: Intrusion st ∈ {0, 1}, terminal ∅. I Observations: I IPS Alerts ∆x1,t, ∆x2,t, . . . , ∆xM,t, defender stops remaining lt ∈ {1, .., L}, fX (∆x1, . . . , ∆xM|st) I Actions: I A1 = A2 = {S, C} I Rewards: I Defender reward: security and service. I Attacker reward: negative of defender reward. I Transition probabilities: I Follows from game dynamics. I Horizon: I ∞ 0 1 ∅ t ≥ 1 lt > 0 t ≥ 2 lt > 0 a (2) t = S lt = 1 a (1) t = S p r e v e n t e d w . p φ l t o r s t o p p e d ( a ( 2 ) t = S )

45. 10/27 One-Sided Partial Observability I We assume that the attacker has perfect information. Only the defender has partial information. I The attacker’s view: s1 s2 s3 . . . sn o1 o2 o3 . . . o5 I The defender’s view: s1 s2 s3 . . . sn o1 o2 o3 . . . o5

46. 10/27 One-Sided Partial Observability I We assume that the attacker has perfect information. Only the defender has partial information. I The attacker’s view: s1 s2 s3 . . . sn o1 o2 o3 . . . o5 I The defender’s view: s1 s2 s3 . . . sn o1 o2 o3 . . . o5 I Makes it tractable to compute the defender’s belief b (1) t (st) = P[st|ht] (avoid nested beliefs)

49. 12/27 Game Analysis I Defender strategy is of the form: π1,l : B → ∆(A1) I Attacker strategy is of the form: π2,l : S × B → ∆(A2) I Defender and attacker objectives: J1(π1,l , π2,l ) = E(π1,l ,π2,l ) " ∞ X t=1 γt−1 Rl (st, at) # J2(π1,l , π2,l ) = −J1 I Best response correspondences: B1(π2,l ) = arg max π1,l ∈Π1 J1(π1,l , π2,l ) B2(π1,l ) = arg max π2,l ∈Π2 J2(π1,l , π2,l ) I Nash equilibrium (π∗ 1,l , π∗ 2,l ): π∗ 1,l ∈ B1(π∗ 2,l ) and π∗ 2,l ∈ B2(π∗ 1,l ) (1)

52. 12/27 Game Analysis I Defender strategy is of the form: π1,l : B → ∆(A1) I Attacker strategy is of the form: π2,l : S × B → ∆(A2) I Defender and attacker objectives: J1(π1,l , π2,l ) = E(π1,l ,π2,l ) " ∞ X t=1 γt−1 Rl (st, at) # J2(π1,l , π2,l ) = −J1 I Best response correspondences: B1(π2,l ) = arg max π1,l ∈Π1 J1(π1,l , π2,l ) B2(π1,l ) = arg max π2,l ∈Π2 J2(π1,l , π2,l ) I Nash equilibrium (π∗ 1,l , π∗ 2,l ): π∗ 1,l ∈ B1(π∗ 2,l ) and π∗ 2,l ∈ B2(π∗ 1,l )

53. 13/27 Game Analysis Theorem Given the one-sided POSG Γ with L ≥ 1, the following holds. (A) Γ has a mixed Nash equilibrium. Further, Γ has a pure Nash equilibrium when s = 0 ⇐⇒ b(1) = 0. (B) Given any attacker strategy π2,l ∈ Π2, if fO|s is totally positive of order 2, there exist values α̃1 ≥ α̃2 ≥ . . . ≥ α̃L ∈ [0, 1] and a defender best response strategy π̃1,l ∈ B1(π2,l ) that satisfies: π̃1,l (b(1)) = S ⇐⇒ b(1) ≥ α̃l l ∈ 1, . . . , L (4) (C) Given a defender strategy π1,l ∈ Π1, where π1,l (S|b(1)) is non-decreasing in b(1) and π1,l (S|1) = 1, there exist values β̃0,1, β̃1,1, . . ., β̃0,L, β̃1,L ∈ [0, 1] and a best response strategy π̃2,l ∈ B2(π1,l ) of the attacker that satisfies: π̃2,l (0, b(1)) = C ⇐⇒ π1,l (S|b(1)) ≥ β̃0,l (5) π̃2,l (1, b(1)) = S ⇐⇒ π1,l (S|b(1)) ≥ β̃1,l (6)

56. 14/27 Structure of Best Response Strategies b(1) 0 1

57. 14/27 Structure of Best Response Strategies b(1) 0 1 S (1) 1 α̃1

58. 14/27 Structure of Best Response Strategies b(1) 0 1 S (1) 1 S (1) 2 α̃1 α̃2

59. 14/27 Structure of Best Response Strategies b(1) 0 1 S (1) 1 S (1) 2 . . . S (1) L α̃1 α̃2 α̃L . . .

60. 14/27 Structure of Best Response Strategies b(1) 0 1 S (1) 1,π2,l S (1) 2,π2,l . . . S (1) L,π2,l α̃1 α̃2 α̃L . . . b(1) 0 1 S (2) 1,1,π1,l S (2) 1,L,π1,l β̃1,1 β̃1,L β̃0,1 . . . β̃0,L . . . S (2) 0,1,π1,l S (2) 0,L,π1,l

63. 16/27 Our Method for Learning Effective Security Strategies s1,1 s1,2 s1,3 . . . s1,n s2,1 s2,2 s2,3 . . . s2,n . . . . . . . . . . . . . . . Emulation System Target Infrastructure Model Creation & System Identification Strategy Mapping π Selective Replication Strategy Implementation π Simulation System Reinforcement Learning & Generalization Strategy evaluation & Model estimation Automation & Self-learning systems

64. 17/27 Emulating the Target Infrastructure I Emulate hosts with docker containers I Emulate IPS and vulnerabilities with software I Network isolation and traffic shaping through NetEm in the Linux kernel I Enforce resource constraints using cgroups. I Emulate client arrivals with Poisson process I Internal connections are full-duplex & loss-less with bit capacities of 1000 Mbit/s I External connections are full-duplex with bit capacities of 100 Mbit/s & 0.1% packet loss in normal operation and random bursts of 1% packet loss Attacker Clients . . . Defender 1 IPS 1 alerts Gateway 7 8 9 10 11 6 5 4 3 2 12 13 14 15 16 17 18 19 21 23 20 22 24 25 26 27 28 29 30 31

65. 18/27 System Identification: Instantiating the Game Model based on Data from the Emulation ˆ f O (o t |0) Probability distribution of # IPS alerts weighted by priority ot 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 ˆ f O (o t |1) Fitted model Distribution st = 0 Distribution st = 1 I We fit a Gaussian mixture distribution ˆ fO as an estimate of fO in the target infrastructure I For each state s, we obtain the conditional distribution ˆ fO|s through expectation-maximization

66. 19/27 Our Reinforcement Learning Approach I We learn a Nash equilibrium (π∗ 1,l,θ(1) , π∗ 2,l,θ(2) ) through fictitious self-play. I In each iteration: 1. Learn a best response strategy of the defender by solving a POMDP π̃1,l,θ(1) ∈ B1(π2,l,θ(2) ). 2. Learn a best response strategy of the attacker by solving an MDP π̃2,l,θ(2) ∈ B2(π1,l,θ(1) ). 3. Store the best response strategies in two buffers Θ1, Θ2 4. Update strategies to be the average of the stored strategies Strategy π2 Strategy π1 Strategy π0 2 Strategy π0 1 . . . Strategy π∗ 2 Strategy π∗ 1 Self-play process (Pseudo-code is available in the paper)

67. 20/27 Our Reinforcement Learning Algorithm for Learning Best-Response Threshold Strategies I We use the structural result that threshold best response strategies exist (Theorem 1) to design an efficient reinforcement learning algorithm to learn best response strategies. I We seek to learn: I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1] I 2L thresholds of the attacker, β̃0,1, β̃1,1, . . . , β̃0,L, β̃1,L ∈ [0, 1] I We learn these thresholds iteratively through Robbins and Monro’s stochastic approximation algorithm.1 Monte-Caro Simulation Stochastic Approximation (RL) θn+1 ˆ θ∗ Performance estimate Ĵ(θn) θ1 θn 1 Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url: https://doi.org/10.1214/aoms/1177729586.

68. 20/27 Our Reinforcement Learning Algorithm for Learning Best-Response Threshold Strategies I We use the structural result that threshold best response strategies exist (Theorem 1) to design an efficient reinforcement learning algorithm to learn best response strategies. I We seek to learn: I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1] I 2L thresholds of the attacker, β̃1, β̃2, . . . , β̃L ∈ [0, 1] I We learn these thresholds iteratively through Robbins and Monro’s stochastic approximation algorithm.2 Monte-Caro Simulation Stochastic Approximation (RL) θn+1 ˆ θ∗ Performance estimate Ĵ(θn) θ1 θn 2 Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url: https://doi.org/10.1214/aoms/1177729586.

69. 20/27 Our Reinforcement Learning Algorithm for Learning Best-Response Threshold Strategies I We use the structural result that threshold best response strategies exist (Theorem 1) to design an efficient reinforcement learning algorithm to learn best response strategies. I We seek to learn: I L thresholds of the defender, α̃1, ≥ α̃2, . . . , ≥ α̃L ∈ [0, 1] I 2L thresholds of the attacker, β̃1, β̃2, . . . , β̃L ∈ [0, 1] I We learn these thresholds iteratively through Robbins and Monro’s stochastic approximation algorithm.3 Monte-Caro Simulation Stochastic Approximation (RL) θn+1 ˆ θ∗ Performance estimate Ĵ(θn) θ1 θn 3 Herbert Robbins and Sutton Monro. “A Stochastic Approximation Method”. In: The Annals of Mathematical Statistics 22.3 (1951), pp. 400 –407. doi: 10.1214/aoms/1177729586. url: https://doi.org/10.1214/aoms/1177729586.

70. 21/27 Our Reinforcement Learning Algorithm: T-FP 1. We learn the thresholds through simulation. 2. For each iteration n ∈ {1, 2, . . .}, we perturb θ (i) n to obtain θ (i) n + cn∆n and θ (i) n − cn∆n. 3. Then, we run two MDP or POMDP episodes 4. We then use the obtained episode outcomes Ĵi (θ (i) n + cn∆n) and Ĵi (θ (i) n − cn∆n) to estimate ∇θ(i) Ji (θ(i)) using the Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimator4: ˆ ∇θ (i) n Ji (θ(i) n ) k = Ĵi (θ (i) n + cn∆n) − Ĵi (θ (i) n − cn∆n) 2cn(∆n)k 5. Next, we use the estimated gradient and update the vector of thresholds through the stochastic approximation update: θ (i) n+1 = θ(i) n + an ˆ ∇θ (i) n Ji (θ(i) n ) 4 James C. Spall. “Multivariate Stochastic Approximation Using a Simultaneous Perturbation Gradient Approximation”. In: IEEE TRANSACTIONS ON AUTOMATIC CONTROL 37.3 (1992), pp. 332–341.

75. 22/27 Outline I Use Case Approach: I Use case: Intrusion prevention I Approach: Emulation, simulation, and reinforcement learning I Game-Theoretic Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Existence of Nash Equilibria I Structural result: multi-threshold best responses I Our Method for Learning Equilibrium Strategies I Our method for emulating the target infrastructure I Our system identification algorithm I Our reinforcement learning algorithm: T-FP I Results Conclusion I Numerical evaluation results, conclusion, and future work

76. 22/27 Outline I Use Case Approach: I Use case: Intrusion prevention I Approach: Emulation, simulation, and reinforcement learning I Game-Theoretic Model of The Use Case I Intrusion prevention as an optimal stopping problem I Partially observed stochastic game I Game Analysis and Structure of (π̃1, π̃2) I Existence of Nash Equilibria I Structural result: multi-threshold best responses I Our Method for Learning Equilibrium Strategies I Our method for emulating the target infrastructure I Our system identification algorithm I Our reinforcement learning algorithm: T-FP I Results Conclusion I Numerical evaluation results, conclusion, and future work

77. 23/27 Evaluation Results: Learning Nash Equilibrium Strategies 0 20 40 60 80 100 # training iterations 0 1 2 3 Exploitability 0 20 40 60 80 100 # training iterations −5.0 −2.5 0.0 2.5 5.0 Defender reward per episode 0 20 40 60 80 100 # training iterations 0 1 2 3 Intrusion length (π1,l, π2,l) emulation (π1,l, π2,l) simulation Snort IPS ot ≥ 1 upper bound Learning curves from the self-play process with T-FP; the red curve show simulation results and the blue curves show emulation results; the purple, orange, and black curves relate to baseline strategies; the curves indicate the mean and the 95% confidence interval over four training runs with different random seeds.

78. 24/27 Evaluation Results: Converge Rates 0 10 20 30 40 50 60 running time (min) 0 2 4 Exploitability 0 10 20 30 40 50 60 running time (min) 0 250 500 750 Approximation error (gap) 55.0 57.5 60.0 7.5 10.0 T-FP NFSP HSVI Comparison between T-FP and two baseline algorithms: NFSP and HSVI; the red curve relate to T-FP; the blue curve to NFSP; the purple curve to HSVI; the left plot shows the approximate exploitability metric and the right plot shows the HSVI approximation error5 . 5 Karel Horák, Branislav Bošanský, and Michal Pěchouček. “Heuristic Search Value Iteration for One-Sided Partially Observable Stochastic Games”. In: Proceedings of the AAAI Conference on Artificial Intelligence (2017). url: https://ojs.aaai.org/index.php/AAAI/article/view/10597.

79. 25/27 Evaluation Results: Inspection of Learned Strategies 0.5 b(1) ∈ B 0.0 0.5 1.0 π2,l(S|0, π1(S|b(1))) 0.5 b(1) ∈ B π2,l(S|1, π1(S|b(1))) 0.5 b(1) ∈ B π1,l(S|b(1)) l = 7 l = 4 l = 1 Probability of the stop action S by the learned equilibrium strategies in function of b(1) and l; the left and middle plots show the attacker’s stopping probability when s = 0 and s = 1, respectively; the right plot shows the defender’s stopping probability.

80. 26/27 Evaluation Results: Inspection of Learned Game Values V̂ ∗ 7 (b(1)) V̂ ∗ 1 (b(1)) 1 b(1) 0 −0.25 The value function ˆ V ∗ l (b(1)) computed through the HSVI algorithm with approximation error 4; the blue and red curves relate to l = 7 and l = 1, respectively.

81. 27/27 Conclusions Future Work I Conclusions: I We develop a method to automatically learn security strategies I (1) emulation system; (2) system identification; (3) simulation system; and (4) reinforcement learning. I We apply the method to an intrusion prevention use case I We formulate intrusion prevention as a stopping game I We present a Partially Observed Stochastic Game of the use case I We present a POMDP model of the defender’s problem I We present a MDP model of the attacker’s problem I We apply the stopping theory to establish structural results of the best response strategies and existence of Nash equilibria. I We show numerical results in realistic emulation environment I We show that our method outperforms two state-of-the-art methods I Our research plans: I Extend the model (remove limiting assumptions) I Less restrictions on defender

Learning Security Strategies through Game Play and Optimal Stopping

Recommended

Recommended

More Related Content

Similar to Learning Security Strategies through Game Play and Optimal Stopping

Similar to Learning Security Strategies through Game Play and Optimal Stopping (20)

More from Kim Hammar

More from Kim Hammar (10)

Recently uploaded

Recently uploaded (20)

Learning Security Strategies through Game Play and Optimal Stopping