Jsai final final final

Development of Anomaly Prediction Detection
Method Using Bayesian Inverse
Reinforcement Learning
-PERC
○Dinesh B. Malla*1,3 (M2) , Tomah Sogabe1,2,3,Masaru Sogabe3,
Tomoaki Kimura1, Katsuyoshi Sakamoto1,2,
Koichi Yamaguch1,2
*1 Info-Powered Energy System Research Center,
*2Department of Engineering Science,
The University of Electro-Communications,
*3 Technology Solution Group, Grid Inc.,

Outline
Introduction
Reward calculation
Max entropy method
Bayes method
summary
Reinforcement learning(RL) Inverse reinforcement learning(IRL)
1

Objectives
Bayes neural network for Inverse reinforcement
learning.
Reinforcement learning for intentional or goal-
oriented anomaly detection.
2

MDP Formulation  𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0
max
𝜋∈Π
𝑉(𝜋) ≜ max
𝜋∈Π
𝔼 𝜋[𝑟 𝑠, 𝑎 ] = max
𝜋∈Π
𝔼 𝑠0~𝑝0
𝑡=0
∞
𝛾 𝑡 𝑟 𝑠𝑡, 𝑎 𝑡 |𝜋
S : state r : reward function 𝑃0: starting state distribution
A : action P : model 𝛾 : discount factor
Goal: maximize cumulative rewards
Fully specified MDP: value iteration & policy iteration
RL for fully specified MDP 3

Policy Gradient Theorem
𝛻𝜃 𝑉(𝜃) = 𝔼 𝜋 𝜃
𝛻𝜃 𝑙𝑜𝑔𝜋 𝜃(𝑠, 𝑎)𝑄 𝜋 𝜃(𝑠, 𝑎)
MDP Formulation  𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0x
Model free RL
𝑉(𝜋 𝜃) = 𝔼 𝑠 𝜃~𝑝0
𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒 𝑟𝑒𝑤𝑎𝑟𝑑𝑠|𝜋 𝜃
𝜃 ← 𝜃 + 𝛼∇ 𝜃 𝑙𝑜𝑔𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡
Dynamics of model p is unknown
Sutton et al., ICML 1999 http://christinemcleavey.com/world-cup-reinforcement-learning/
The policy gradient methods target at modeling
and optimizing the policy directly. The policy is
usually modeled with a parameterized function
respect to θ, πθ(a|s).
4

MDP Formulation  𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0x
MDP Formulation  𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0xx
Model-given (small searching space environment)
Real world problem (big searching space)
Reward Engineering
Update
reward
Compare
with
expert
Run RL
Reward engineering
Right reward function r is unknown
Dynamics and Right reward function is unknown
5

Inverse RL
Learn r such that:
𝝅∗
= 𝒂𝒓𝒈𝒎𝒂𝒙 𝜽 𝑬 𝒔~𝑷 𝒔|𝜽 𝒓(𝒔,𝝅 𝜽 𝒔 )
Assumes learning r is statistically easier than
directly learning 𝜋∗
Direct policy learning
Collect
Demonstrations
Supervised
Learning
Rollout in
Environment
Via interactive Demonstrator
Requires interactive Demonstrator (BC is 1-step special case)
Behavioral cloning
Works well when P* close to P 𝜃
𝒂𝒓𝒈𝒎𝒊𝒏 𝜽 𝑬(𝒔,𝒂∗)~𝑷∗ 𝑳(𝒂∗
, 𝝅 𝜽 𝒔 )
Training loss
Reward engineering
https://bair.berkeley.edu/blog/202
0/04/03/laikago/
Robots Learning to Move like Animals
6

Learning Direct Policy
Learning
Reward
Learning
Access to
Environment
Interactive
Demonstrator
Pre-collected
Demonstrations
Behavioral
Cloning
Yes No No No Yes
Direct Policy
Learning
(Interactive IL)
Yes No Yes Yes Optional
Inverse
Reinforcement
Learning
No Yes Yes No Yes
What does learning well on B imply about A?
Behavioral CloningImitation Learning
Reward engineering
Summarization of reward engineering
Learning Reductions
7

To change linear constraints to Lagrange
CL by Ziebart et al., AAAI ‘08
1
𝑍(𝜃)
𝑒 𝑠 𝑡∈𝜏 𝜃 𝑇 𝜙(𝑠 𝑡)
Maximum entropy principle: The probability
distribution which best represents the current
state of knowledge is the one with
largest entropy.
Assume: 𝑃 𝜏|𝜃
Maximum entropy IRL
𝑟 𝑠 = 𝜃 𝑇. 𝜙 𝑠𝑡
MaxEnt principle by E.T. Jaynes 1957
Maximum entropy
Policy induces distribution over trajectories
*Many reward function corresponding to the same policy
*Many stochastic mixtures of policies correspond to the same feature expectation
8

Dynamic programming: state occupancy measure (visitation freq.)
𝜃∗ = argmax
𝜃
𝐿 𝜃 = argmax
𝜃
𝜏 𝑖∈𝐷
𝑙𝑜𝑔𝑃(𝜏𝑖|𝜃)
𝛻𝜃 𝐿 𝜃 =
1
𝑚
𝜏 𝑖∈𝐷
𝜇 𝜏𝑖 −
𝑠
𝑑 𝑠
𝜃 𝜙 (𝑠)
𝜇 𝜏𝑖 =
1
𝑇
𝑠′∈𝜏 𝑖
𝜙 (𝑠′
)
𝑑 𝑡+1,𝑠′ =
𝑎 𝑠
𝑑 𝑡,𝑠 𝜋 𝜃 𝑎 𝑠 𝑃(𝑠′|𝑠, 𝑎)
Maximum entropy IRL learning
Ziebart et al., AAAI ‘08
Reward inference: max log likelihood
Gradient descent over log likelihood
9

𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0x x
Model-free:Model-given:
𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0x
Interact with environment /
simulator (Model-free)
Large continuous space
Update
reward
Compare
with
expert
Run
RL
Not Interact with
environment
Small and Given model
State
action
reward
IRL + model-free optimization 10

reward 𝑟𝜃 𝑠𝑡, 𝑎 𝑡 parameterized by neural net 𝑟 𝑠 𝜃 = 𝜃 𝑇 𝜙(𝑠)
Recall MaxEnt formulation
1
𝑍(𝜃)
𝑒 𝑠 𝑡∈𝜏 𝜃 𝑇 𝜙(𝑠 𝑡)
Reward function R by applying Bayes theorem,
𝑍 𝜃 = 𝑒 𝑟(𝜏|𝜃) 𝑑𝜏
Computing the normalizing constant Z’ is hard. However the sampling algorithms we will
use for inference only need the ratios of the densities at two points, so this is not a
problem. Deepak Ramachandran et al., IJCAI07
Bayesian IRL concept
agent
parameter
11

Bayesian IRL using neural network
Google DeepMind 2015 “Weight Uncertainty in Neural Networks”
Neural network as a probabilistic model P(y | x, w)
Each weight is assigned a distribution
Bayes Neural Network
Input( state , action)
Reward
Priors distribution is assume a Gaussian
(µ, σ2)
Training dataset 
={x(i),y(i)}Construct the likelihood function
p(|w)=I
p(y(i)|x(i),w)
p(w|)p(|w)p(w)
Posterior distribution
12

𝜃 𝑀𝐴𝑃 = argmax
𝜃
𝔼 𝜏~𝐷 log 𝑝 𝜏 𝜃 + log 𝑝(𝜃) : = argmax
𝜃
𝐽(𝜃)
𝜕
𝜕𝑥
𝐽 𝜃 = 𝔼 𝜏~𝐷
𝜕
𝜕𝜃
log 𝑝 𝜏 𝜃 +
𝜕
𝜕𝜃
log 𝑝(𝜃)
𝜕
𝜕𝑥
𝐽 𝜃 = 𝔼 𝜏~𝐷
(𝑠,𝑎)∈𝜏
𝜕
𝜕𝜃
𝑟𝜃 𝑠, 𝑎 −
𝜕
𝜕𝜃
𝑙𝑜𝑔𝑍′ +
𝜕
𝜕𝜃
log 𝑝(𝜃)
𝜕
𝜕𝜃
𝑙𝑜𝑔𝑍′
→
(𝑠,𝑎)∈𝜏 𝑠
𝜕
𝜕𝜃
𝑟𝜃 𝑠, 𝑎where
Bayesian IRL model free
Forward of neural network
Approximate the gradient with finite samples
P(𝐷|𝑟) =
𝜏∈𝐷 (𝑠,𝑎)∈𝜏
𝑃(𝑟 𝑠, 𝑎 )𝑃 𝑟 =
(𝑠,𝑎)∈𝜏
𝑃(𝑟 𝑠, 𝑎 )
likelihoodprior
Min-hwan Oh et al., KDD ’19, August 4–8, 2019, Anchorage, AK, USA
13

Bayesian IRL for NAB-dataset
Chengqiang Huang et al., AAAI ‘18
RNN based RL and Bayes IRL for anomaly detection
Rogue Agent Key Hold Data-set Temperature data
Data size 1880
Unit hold 0~0.89
Data size 7268
starting from 2013/07/04 0:00:00
End at 2014/05/28 15:00:00
Time span 60minute
14

Reinforcement
learning
Optimal
policy
Reward
function
Bayes IRL
Expert
trajectory
Reward
function
Environment
(MDP)
Inverse/reinforcement
𝑅𝐿 𝑟 = argmax
𝜋∈Π
𝐻 𝜋 + 𝔼 𝜋 𝑟(𝑠, 𝑎)
𝐼𝑅𝐿 𝜋 𝐸 = argmax
𝑟∈ℝ 𝑆×𝐴
𝔼 𝜋𝐸 𝑟(𝑠, 𝑎) − argmax
𝜋∈Π
Inverse/reinforcement collaboration
samyzaf.com/
15

Reinforcement
learning
Optimal
policy
Bayes IRL
Expert
trajectory
Inverse/reinforcement
𝑅𝐿 𝑟 = argmax
𝜋∈Π
𝐼𝑅𝐿 𝜋 𝐸 = argmax
𝑟∈ℝ 𝑆×𝐴
𝔼 𝜋𝐸 𝑟(𝑠, 𝑎) − argmax
𝜋∈Π
Inverse/reinforcement collaboration
Reward
function Environment
(MDP)
Experttrajectory
samyzaf.com/
16

Start
Target
Movable place
Restricted place
Inverse/reinforcement based learning
Interactive maze environment
17

IRL learning time reward curve
iteration
0
5
10
15
20
25
1 19 37 55 73 91
Reward
Iteration
Best trajectory
Including bad trajectory
Less number trajectory
18

-0.05
-0.045
-0.04
-0.035
-0.03
-0.025
-0.02
-0.015
-0.01
-0.005
0
1 6 11 16 21 26 31 36 41
Reward
Steps for target
16 steps to target
32 steps to target
26 steps to target
38 steps to target
44 steps to target
Intentional or goal-oriented detection
16 steps to target 26 steps to target 32 steps to target 38 steps to target 44 steps to target
19

-0.05
-0.045
-0.04
-0.035
-0.03
-0.025
-0.02
-0.015
-0.01
-0.005
0
1 6 11 16 21 26 31 36 41
Reward
Steps for target
16 steps to target
32 steps to target
26 steps to target
38 steps to target
44 steps to target
16 steps to target
Bayes-IRL: Trajectory anomaly detection
26 steps to target 32 steps to target 38 steps to target 44 steps to target
Test dataNormal trajectory
Normal trajectory
Tajectory-1 Tajectory-2 Tajectory-3 Tajectory-4
Tajectory-4
Tajectory-3
Tajectory-2Tajectory-1

Summary
Reward calculation using model free IRL is one possible way
to calculate the reward,
Teacher data are needed for IRL and correct data helps to
produce good reward, but it is very difficult to know which are
good and which are bad.
 Bayesian NN based IRL is good way for small data set because
NN has overfitting demerits for small data-set.
Future work
So far, we implemented IRL for small interactive environment
and we have not good idea about teacher data and all parameters
of bayes neural network that helps to maximize the likelihood.
So we search more detail in the same work
20

Jsai final final final

More Related Content

What's hot

Similar to Jsai final final final

Recently uploaded

Jsai final final final

Editor's Notes