Policy Based reinforcement Learning for time series Anomaly detection

“Policy-Based Reinforcement
Learning for Time Series
Anomaly Detection“
by Mengran Yu, Shiliang Sun
Presented by Kishor Datta Gupta

Problem
• Anomaly Detection — is the
identification of rare items,
events, or patterns that
significantly differ from the
majority of the data

Problem..
• time series is a sequence of
numerical data points in
successive order.

Problem..
• What is normal?
• How to measure deviation?
• What is higher deviation?
• It can be considered as a Markov Decision Process
problem because the decision of normal and abnormal
pattern at current time step will change the
environment which will affect the next decision.
Application: intrusion detection, credit card fraud, and
medical diagnoses

How we
detect
anomaly
Statistical Methods
Deviations from association rules and frequent item
sets
One-class Support Vector Machine
Clustering-based techniques (k-means)
Density-based techniques (k-nearest neighbor, local
outlier factor)
Autoencoders and replicators (Neural Networks)

Policy-
based time
series
anomaly
detector
(PTAD)
Based on the asynchronous actor-critic
algorithm.
The optimal stochastic policy acquired from
PTAD can adjust and distinguish between
normal and abnormal behaviors of the same
or different source and target datasets.
The behavior policy is Ɛ-greedy.

PTAD
formulation
• The state includes two parts:
• Sequence of previous actions Sa = (at-
m+1,at-m+2….at)
• Current time series St = (St-m+1, St-m+2, St)
• The state space S is infinite due to real time
series with a variety of alterations.
State:
• A = [0,1]
• 0: normal behavior
• 1: anomaly detection
Action:

PTAD
formulation.
Reward:
• R(s,a) = A if action is TP
• R(s,a) = B if action is FP
• R(s,a) = C if action is FN
• R(s,a) = D if action is TN.
• It utilizes a confusion matrix where positive means
anomaly and negative means normal behavior. The
values of A-D can be altered according to demand.

PTAD
formulation..
Policy:
• Deterministic policy:
• Value-based detector which gives action to be
taken under current state
• Stochastic policy:
• Provides the probability of each action under the
present state. The criterion of determining an
action can be changed.

PTAD
formulation…
• Environment:
• Time series repository containing large population
of labeld time series data. The environment can
generate specific states for training the agent and
check action quality.
The agent can simulate how the anomaly detector will
operate and does optimization.
• Input: Current time stamps and previous decision
• Output: New decision for next time stamp.

Asynchronous actor-
critic algorithm (A3C)
• Having multiple agents rather than one
single agent with its own network
parameters and copy of the
environment..
• Each agent is controlled by a global
network and contributes to the overall
learning.
• The actor-critic combines the value-
iteration method and policy-gradient
methods, it predicts both the value
function as well as the optimal policy
function.
• The learning agent uses the value of the
value function (critic) to update the
optimal policy function (actor).
•

PTAD construction:
• A3C algorithm is used to construct
PTAD which can decrease
correlations between successive
examples.
• N independent environments
containing whole labeled time series
data with inconsistent ranking. Each
environment provide time stamps of
distinct time series as states and
changes itself after an action has
been taken.
• PTAD has a global network and n
local networks. N env = n local
network. All local networks have actor-
critic framework
• Every agent has a different initial
environment to improve the anomaly
detection performance as agents
learn from different situations. These
avoids overfitting abnormal patterns.
The global network accumulates the
gradients from workers and optimizes
the policy.

Experiment :The PTAD
is trained with a multi-
core CPU of 8 threads
without the GPU. The
local network delivers
the gradients to the
global network every 5
steps and the learning
rates of actor network
and critic network are
0.001 and 0.0001,
respectively. The total
number of training
episodes is 20000. The
parameters in reward
function is set as A = C
= 5, B = D = 1.

Advantage
PTAD achieves the best performance not only on the same but also
on different source and target datasets.
it has a stochastic policy which slightly improves the detection
performance and can explore the tradeoff between the precision
and the recall for meeting practical requirements.

My thoughts
Using RNN confusion matrix to
calculate RL reward function is
interesting.
They didn’t compare their
result with Autoencoder based
techniques.

Policy Based reinforcement Learning for time series Anomaly detection

More Related Content

What's hot

Similar to Policy Based reinforcement Learning for time series Anomaly detection

More from Kishor Datta Gupta

Recently uploaded

Policy Based reinforcement Learning for time series Anomaly detection