This document discusses a policy-based reinforcement learning approach called PTAD for time series anomaly detection. PTAD formulates anomaly detection as a Markov Decision Process and uses an asynchronous actor-critic algorithm to learn a stochastic policy. The agent takes as input current and previous time series data and actions, and outputs a decision of normal or anomalous. It is rewarded based on a confusion matrix calculation. Experimental results show PTAD achieves best performance both within and across datasets by adjusting to different behaviors. The stochastic policy allows exploring precision-recall tradeoffs. While interesting, it is not compared to neural network based techniques like autoencoders.