The document describes using reinforcement learning to model cyber security as a Markov game and learn security strategies. Specifically: - A network infrastructure is modeled as a graph, with states representing network configuration and actions like patching or attacking components. - This is formulated as a Markov game with two players (attacker and defender). Reinforcement learning is used to approximate optimal policies by representing policies with neural networks and optimizing a policy gradient objective. - Challenges include large state/action spaces and a non-stationary environment. The method addresses these through policy parameterization, auto-regressive policy representations, and training against an opponent pool.