“Policy-Based Reinforcement
Learning for Time Series
Anomaly Detection“
by Mengran Yu, Shiliang Sun
Presented by Kishor Datta Gupta
Problem
• Anomaly Detection — is the
identification of rare items,
events, or patterns that
significantly differ from the
majority of the data
Problem..
• time series is a sequence of
numerical data points in
successive order.
Problem..
• What is normal?
• How to measure deviation?
• What is higher deviation?
• It can be considered as a Markov Decision Process
problem because the decision of normal and abnormal
pattern at current time step will change the
environment which will affect the next decision.
Application: intrusion detection, credit card fraud, and
medical diagnoses
How we
detect
anomaly
Statistical Methods
Deviations from association rules and frequent item
sets
One-class Support Vector Machine
Clustering-based techniques (k-means)
Density-based techniques (k-nearest neighbor, local
outlier factor)
Autoencoders and replicators (Neural Networks)
Policy-
based time
series
anomaly
detector
(PTAD)
Based on the asynchronous actor-critic
algorithm.
The optimal stochastic policy acquired from
PTAD can adjust and distinguish between
normal and abnormal behaviors of the same
or different source and target datasets.
The behavior policy is Ɛ-greedy.
PTAD
formulation
• The state includes two parts:
• Sequence of previous actions Sa = (at-
m+1,at-m+2….at)
• Current time series St = (St-m+1, St-m+2, St)
• The state space S is infinite due to real time
series with a variety of alterations.
State:
• A = [0,1]
• 0: normal behavior
• 1: anomaly detection
Action:
PTAD
formulation.
Reward:
• R(s,a) = A if action is TP
• R(s,a) = B if action is FP
• R(s,a) = C if action is FN
• R(s,a) = D if action is TN.
• It utilizes a confusion matrix where positive means
anomaly and negative means normal behavior. The
values of A-D can be altered according to demand.
PTAD
formulation..
Policy:
• Deterministic policy:
• Value-based detector which gives action to be
taken under current state
• Stochastic policy:
• Provides the probability of each action under the
present state. The criterion of determining an
action can be changed.
PTAD
formulation…
• Environment:
• Time series repository containing large population
of labeld time series data. The environment can
generate specific states for training the agent and
check action quality.
The agent can simulate how the anomaly detector will
operate and does optimization.
• Input: Current time stamps and previous decision
• Output: New decision for next time stamp.
Asynchronous actor-
critic algorithm (A3C)
• Having multiple agents rather than one
single agent with its own network
parameters and copy of the
environment..
• Each agent is controlled by a global
network and contributes to the overall
learning.
• The actor-critic combines the value-
iteration method and policy-gradient
methods, it predicts both the value
function as well as the optimal policy
function.
• The learning agent uses the value of the
value function (critic) to update the
optimal policy function (actor).
•
PTAD construction:
• A3C algorithm is used to construct
PTAD which can decrease
correlations between successive
examples.
• N independent environments
containing whole labeled time series
data with inconsistent ranking. Each
environment provide time stamps of
distinct time series as states and
changes itself after an action has
been taken.
• PTAD has a global network and n
local networks. N env = n local
network. All local networks have actor-
critic framework
• Every agent has a different initial
environment to improve the anomaly
detection performance as agents
learn from different situations. These
avoids overfitting abnormal patterns.
The global network accumulates the
gradients from workers and optimizes
the policy.
PTAD Components:
Experiment :The PTAD
is trained with a multi-
core CPU of 8 threads
without the GPU. The
local network delivers
the gradients to the
global network every 5
steps and the learning
rates of actor network
and critic network are
0.001 and 0.0001,
respectively. The total
number of training
episodes is 20000. The
parameters in reward
function is set as A = C
= 5, B = D = 1.
Experimental results
Advantage
PTAD achieves the best performance not only on the same but also
on different source and target datasets.
it has a stochastic policy which slightly improves the detection
performance and can explore the tradeoff between the precision
and the recall for meeting practical requirements.
My thoughts
Using RNN confusion matrix to
calculate RL reward function is
interesting.
They didn’t compare their
result with Autoencoder based
techniques.
Questions ?

Policy Based reinforcement Learning for time series Anomaly detection

  • 1.
    “Policy-Based Reinforcement Learning forTime Series Anomaly Detection“ by Mengran Yu, Shiliang Sun Presented by Kishor Datta Gupta
  • 2.
    Problem • Anomaly Detection— is the identification of rare items, events, or patterns that significantly differ from the majority of the data
  • 3.
    Problem.. • time seriesis a sequence of numerical data points in successive order.
  • 4.
    Problem.. • What isnormal? • How to measure deviation? • What is higher deviation? • It can be considered as a Markov Decision Process problem because the decision of normal and abnormal pattern at current time step will change the environment which will affect the next decision. Application: intrusion detection, credit card fraud, and medical diagnoses
  • 5.
    How we detect anomaly Statistical Methods Deviationsfrom association rules and frequent item sets One-class Support Vector Machine Clustering-based techniques (k-means) Density-based techniques (k-nearest neighbor, local outlier factor) Autoencoders and replicators (Neural Networks)
  • 6.
    Policy- based time series anomaly detector (PTAD) Based onthe asynchronous actor-critic algorithm. The optimal stochastic policy acquired from PTAD can adjust and distinguish between normal and abnormal behaviors of the same or different source and target datasets. The behavior policy is Ɛ-greedy.
  • 7.
    PTAD formulation • The stateincludes two parts: • Sequence of previous actions Sa = (at- m+1,at-m+2….at) • Current time series St = (St-m+1, St-m+2, St) • The state space S is infinite due to real time series with a variety of alterations. State: • A = [0,1] • 0: normal behavior • 1: anomaly detection Action:
  • 8.
    PTAD formulation. Reward: • R(s,a) =A if action is TP • R(s,a) = B if action is FP • R(s,a) = C if action is FN • R(s,a) = D if action is TN. • It utilizes a confusion matrix where positive means anomaly and negative means normal behavior. The values of A-D can be altered according to demand.
  • 9.
    PTAD formulation.. Policy: • Deterministic policy: •Value-based detector which gives action to be taken under current state • Stochastic policy: • Provides the probability of each action under the present state. The criterion of determining an action can be changed.
  • 10.
    PTAD formulation… • Environment: • Timeseries repository containing large population of labeld time series data. The environment can generate specific states for training the agent and check action quality. The agent can simulate how the anomaly detector will operate and does optimization. • Input: Current time stamps and previous decision • Output: New decision for next time stamp.
  • 11.
    Asynchronous actor- critic algorithm(A3C) • Having multiple agents rather than one single agent with its own network parameters and copy of the environment.. • Each agent is controlled by a global network and contributes to the overall learning. • The actor-critic combines the value- iteration method and policy-gradient methods, it predicts both the value function as well as the optimal policy function. • The learning agent uses the value of the value function (critic) to update the optimal policy function (actor). •
  • 12.
    PTAD construction: • A3Calgorithm is used to construct PTAD which can decrease correlations between successive examples. • N independent environments containing whole labeled time series data with inconsistent ranking. Each environment provide time stamps of distinct time series as states and changes itself after an action has been taken. • PTAD has a global network and n local networks. N env = n local network. All local networks have actor- critic framework • Every agent has a different initial environment to improve the anomaly detection performance as agents learn from different situations. These avoids overfitting abnormal patterns. The global network accumulates the gradients from workers and optimizes the policy.
  • 13.
  • 14.
    Experiment :The PTAD istrained with a multi- core CPU of 8 threads without the GPU. The local network delivers the gradients to the global network every 5 steps and the learning rates of actor network and critic network are 0.001 and 0.0001, respectively. The total number of training episodes is 20000. The parameters in reward function is set as A = C = 5, B = D = 1.
  • 15.
  • 16.
    Advantage PTAD achieves thebest performance not only on the same but also on different source and target datasets. it has a stochastic policy which slightly improves the detection performance and can explore the tradeoff between the precision and the recall for meeting practical requirements.
  • 17.
    My thoughts Using RNNconfusion matrix to calculate RL reward function is interesting. They didn’t compare their result with Autoencoder based techniques.
  • 18.