The document describes using reinforcement learning to model cyber security as a Markov game and learn security strategies. Specifically:
- A network infrastructure is modeled as a graph, with states representing network configuration and actions like patching or attacking components.
- This is formulated as a Markov game with two players (attacker and defender). Reinforcement learning is used to approximate optimal policies by representing policies with neural networks and optimizing a policy gradient objective.
- Challenges include large state/action spaces and a non-stationary environment. The method addresses these through policy parameterization, auto-regressive policy representations, and training against an opponent pool.