Development of Anomaly Prediction Detection
Method Using Bayesian Inverse
Reinforcement Learning
-PERC
○Dinesh B. Malla*1,3 (M2) , Tomah Sogabe1,2,3,Masaru Sogabe3,
Tomoaki Kimura1, Katsuyoshi Sakamoto1,2,
Koichi Yamaguch1,2
*1 Info-Powered Energy System Research Center,
*2Department of Engineering Science,
The University of Electro-Communications,
*3 Technology Solution Group, Grid Inc.,
Outline
Introduction
Reward calculation
Max entropy method
Bayes method
summary
Reinforcement learning(RL) Inverse reinforcement learning(IRL)
1
Objectives
Bayes neural network for Inverse reinforcement
learning.
Reinforcement learning for intentional or goal-
oriented anomaly detection.
2
MDP Formulation  𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0
max
𝜋∈Π
𝑉(𝜋) ≜ max
𝜋∈Π
𝔼 𝜋[𝑟 𝑠, 𝑎 ] = max
𝜋∈Π
𝔼 𝑠0~𝑝0
𝑡=0
∞
𝛾 𝑡 𝑟 𝑠𝑡, 𝑎 𝑡 |𝜋
S : state r : reward function 𝑃0: starting state distribution
A : action P : model 𝛾 : discount factor
Goal: maximize cumulative rewards
Fully specified MDP: value iteration & policy iteration
RL for fully specified MDP 3
Policy Gradient Theorem
𝛻𝜃 𝑉(𝜃) = 𝔼 𝜋 𝜃
𝛻𝜃 𝑙𝑜𝑔𝜋 𝜃(𝑠, 𝑎)𝑄 𝜋 𝜃(𝑠, 𝑎)
MDP Formulation  𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0x
Model free RL
𝑉(𝜋 𝜃) = 𝔼 𝑠 𝜃~𝑝0
𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒 𝑟𝑒𝑤𝑎𝑟𝑑𝑠|𝜋 𝜃
𝜃 ← 𝜃 + 𝛼∇ 𝜃 𝑙𝑜𝑔𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡
Dynamics of model p is unknown
Sutton et al., ICML 1999 http://christinemcleavey.com/world-cup-reinforcement-learning/
The policy gradient methods target at modeling
and optimizing the policy directly. The policy is
usually modeled with a parameterized function
respect to θ, πθ(a|s).
4
MDP Formulation  𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0x
MDP Formulation  𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0xx
Model-given (small searching space environment)
Real world problem (big searching space)
Reward Engineering
Update
reward
Compare
with
expert
Run RL
Reward engineering
Right reward function r is unknown
Dynamics and Right reward function is unknown
5
Inverse RL
Learn r such that:
𝝅∗
= 𝒂𝒓𝒈𝒎𝒂𝒙 𝜽 𝑬 𝒔~𝑷 𝒔|𝜽 𝒓(𝒔,𝝅 𝜽 𝒔 )
Assumes learning r is statistically easier than
directly learning 𝜋∗
Direct policy learning
Collect
Demonstrations
Supervised
Learning
Rollout in
Environment
Via interactive Demonstrator
Requires interactive Demonstrator (BC is 1-step special case)
Behavioral cloning
Works well when P* close to P 𝜃
𝒂𝒓𝒈𝒎𝒊𝒏 𝜽 𝑬(𝒔,𝒂∗)~𝑷∗ 𝑳(𝒂∗
, 𝝅 𝜽 𝒔 )
Training loss
Reward engineering
https://bair.berkeley.edu/blog/202
0/04/03/laikago/
Robots Learning to Move like Animals
6
Learning Direct Policy
Learning
Reward
Learning
Access to
Environment
Interactive
Demonstrator
Pre-collected
Demonstrations
Behavioral
Cloning
Yes No No No Yes
Direct Policy
Learning
(Interactive IL)
Yes No Yes Yes Optional
Inverse
Reinforcement
Learning
No Yes Yes No Yes
What does learning well on B imply about A?
Behavioral CloningImitation Learning
Reward engineering
Summarization of reward engineering
Learning Reductions
7
To change linear constraints to Lagrange
CL by Ziebart et al., AAAI ‘08
1
𝑍(𝜃)
𝑒 𝑠 𝑡∈𝜏 𝜃 𝑇 𝜙(𝑠 𝑡)
Maximum entropy principle: The probability
distribution which best represents the current
state of knowledge is the one with
largest entropy.
Assume: 𝑃 𝜏|𝜃
Maximum entropy IRL
𝑟 𝑠 = 𝜃 𝑇. 𝜙 𝑠𝑡
MaxEnt principle by E.T. Jaynes 1957
Maximum entropy
Policy induces distribution over trajectories
*Many reward function corresponding to the same policy
*Many stochastic mixtures of policies correspond to the same feature expectation
8
Dynamic programming: state occupancy measure (visitation freq.)
𝜃∗ = argmax
𝜃
𝐿 𝜃 = argmax
𝜃
𝜏 𝑖∈𝐷
𝑙𝑜𝑔𝑃(𝜏𝑖|𝜃)
𝛻𝜃 𝐿 𝜃 =
1
𝑚
𝜏 𝑖∈𝐷
𝜇 𝜏𝑖 −
𝑠
𝑑 𝑠
𝜃 𝜙 (𝑠)
𝜇 𝜏𝑖 =
1
𝑇
𝑠′∈𝜏 𝑖
𝜙 (𝑠′
)
𝑑 𝑡+1,𝑠′ =
𝑎 𝑠
𝑑 𝑡,𝑠 𝜋 𝜃 𝑎 𝑠 𝑃(𝑠′|𝑠, 𝑎)
Maximum entropy IRL learning
Ziebart et al., AAAI ‘08
Reward inference: max log likelihood
Gradient descent over log likelihood
9
𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0x x
Model-free:Model-given:
𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0x
Interact with environment /
simulator (Model-free)
Large continuous space
Update
reward
Compare
with
expert
Run
RL
Not Interact with
environment
Small and Given model
State
action
reward
IRL + model-free optimization 10
reward 𝑟𝜃 𝑠𝑡, 𝑎 𝑡 parameterized by neural net 𝑟 𝑠 𝜃 = 𝜃 𝑇 𝜙(𝑠)
Recall MaxEnt formulation
1
𝑍(𝜃)
𝑒 𝑠 𝑡∈𝜏 𝜃 𝑇 𝜙(𝑠 𝑡)
Reward function R by applying Bayes theorem,
𝑍 𝜃 = 𝑒 𝑟(𝜏|𝜃) 𝑑𝜏
Computing the normalizing constant Z’ is hard. However the sampling algorithms we will
use for inference only need the ratios of the densities at two points, so this is not a
problem. Deepak Ramachandran et al., IJCAI07
Bayesian IRL concept
agent
parameter
11
Bayesian IRL using neural network
Google DeepMind 2015 “Weight Uncertainty in Neural Networks”
Neural network as a probabilistic model P(y | x, w)
Each weight is assigned a distribution
Bayes Neural Network
Input( state , action)
Reward
Priors distribution is assume a Gaussian
(µ, σ2)
Training dataset 
={x(i),y(i)}Construct the likelihood function
p(|w)=I
p(y(i)|x(i),w)
p(w|)p(|w)p(w)
Posterior distribution
12
𝜃 𝑀𝐴𝑃 = argmax
𝜃
𝔼 𝜏~𝐷 log 𝑝 𝜏 𝜃 + log 𝑝(𝜃) : = argmax
𝜃
𝐽(𝜃)
𝜕
𝜕𝑥
𝐽 𝜃 = 𝔼 𝜏~𝐷
𝜕
𝜕𝜃
log 𝑝 𝜏 𝜃 +
𝜕
𝜕𝜃
log 𝑝(𝜃)
𝜕
𝜕𝑥
𝐽 𝜃 = 𝔼 𝜏~𝐷
(𝑠,𝑎)∈𝜏
𝜕
𝜕𝜃
𝑟𝜃 𝑠, 𝑎 −
𝜕
𝜕𝜃
𝑙𝑜𝑔𝑍′ +
𝜕
𝜕𝜃
log 𝑝(𝜃)
𝜕
𝜕𝜃
𝑙𝑜𝑔𝑍′
→
(𝑠,𝑎)∈𝜏 𝑠
𝜕
𝜕𝜃
𝑟𝜃 𝑠, 𝑎where
Bayesian IRL model free
Forward of neural network
Approximate the gradient with finite samples
P(𝐷|𝑟) =
𝜏∈𝐷 (𝑠,𝑎)∈𝜏
𝑃(𝑟 𝑠, 𝑎 )𝑃 𝑟 =
(𝑠,𝑎)∈𝜏
𝑃(𝑟 𝑠, 𝑎 )
likelihoodprior
Min-hwan Oh et al., KDD ’19, August 4–8, 2019, Anchorage, AK, USA
13
Bayesian IRL for NAB-dataset
Chengqiang Huang et al., AAAI ‘18
RNN based RL and Bayes IRL for anomaly detection
Rogue Agent Key Hold Data-set Temperature data
Data size 1880
Unit hold 0~0.89
Data size 7268
starting from 2013/07/04 0:00:00
End at 2014/05/28 15:00:00
Time span 60minute
14
Reinforcement
learning
Optimal
policy
Reward
function
Bayes IRL
Expert
trajectory
Reward
function
Environment
(MDP)
Inverse/reinforcement
𝑅𝐿 𝑟 = argmax
𝜋∈Π
𝐻 𝜋 + 𝔼 𝜋 𝑟(𝑠, 𝑎)
𝐼𝑅𝐿 𝜋 𝐸 = argmax
𝑟∈ℝ 𝑆×𝐴
𝔼 𝜋𝐸 𝑟(𝑠, 𝑎) − argmax
𝜋∈Π
𝐻 𝜋 + 𝔼 𝜋 𝑟(𝑠, 𝑎)
Inverse/reinforcement collaboration
samyzaf.com/
15
Reinforcement
learning
Optimal
policy
Bayes IRL
Expert
trajectory
Inverse/reinforcement
𝑅𝐿 𝑟 = argmax
𝜋∈Π
𝐻 𝜋 + 𝔼 𝜋 𝑟(𝑠, 𝑎)
𝐼𝑅𝐿 𝜋 𝐸 = argmax
𝑟∈ℝ 𝑆×𝐴
𝔼 𝜋𝐸 𝑟(𝑠, 𝑎) − argmax
𝜋∈Π
𝐻 𝜋 + 𝔼 𝜋 𝑟(𝑠, 𝑎)
Inverse/reinforcement collaboration
Reward
function Environment
(MDP)
Experttrajectory
samyzaf.com/
16
Start
Target
Movable place
Restricted place
Inverse/reinforcement based learning
Interactive maze environment
17
IRL learning time reward curve
iteration
0
5
10
15
20
25
1 19 37 55 73 91
Reward
Iteration
Best trajectory
Including bad trajectory
Less number trajectory
18
-0.05
-0.045
-0.04
-0.035
-0.03
-0.025
-0.02
-0.015
-0.01
-0.005
0
1 6 11 16 21 26 31 36 41
Reward
Steps for target
16 steps to target
32 steps to target
26 steps to target
38 steps to target
44 steps to target
Intentional or goal-oriented detection
16 steps to target 26 steps to target 32 steps to target 38 steps to target 44 steps to target
19
-0.05
-0.045
-0.04
-0.035
-0.03
-0.025
-0.02
-0.015
-0.01
-0.005
0
1 6 11 16 21 26 31 36 41
Reward
Steps for target
16 steps to target
32 steps to target
26 steps to target
38 steps to target
44 steps to target
16 steps to target
Bayes-IRL: Trajectory anomaly detection
26 steps to target 32 steps to target 38 steps to target 44 steps to target
Test dataNormal trajectory
Normal trajectory
Tajectory-1 Tajectory-2 Tajectory-3 Tajectory-4
Tajectory-4
Tajectory-3
Tajectory-2Tajectory-1
Summary
Reward calculation using model free IRL is one possible way
to calculate the reward,
Teacher data are needed for IRL and correct data helps to
produce good reward, but it is very difficult to know which are
good and which are bad.
 Bayesian NN based IRL is good way for small data set because
NN has overfitting demerits for small data-set.
Future work
So far, we implemented IRL for small interactive environment
and we have not good idea about teacher data and all parameters
of bayes neural network that helps to maximize the likelihood.
So we search more detail in the same work
20
Jsai final final final

Jsai final final final

  • 1.
    Development of AnomalyPrediction Detection Method Using Bayesian Inverse Reinforcement Learning -PERC ○Dinesh B. Malla*1,3 (M2) , Tomah Sogabe1,2,3,Masaru Sogabe3, Tomoaki Kimura1, Katsuyoshi Sakamoto1,2, Koichi Yamaguch1,2 *1 Info-Powered Energy System Research Center, *2Department of Engineering Science, The University of Electro-Communications, *3 Technology Solution Group, Grid Inc.,
  • 2.
    Outline Introduction Reward calculation Max entropymethod Bayes method summary Reinforcement learning(RL) Inverse reinforcement learning(IRL) 1
  • 3.
    Objectives Bayes neural networkfor Inverse reinforcement learning. Reinforcement learning for intentional or goal- oriented anomaly detection. 2
  • 4.
    MDP Formulation 𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0 max 𝜋∈Π 𝑉(𝜋) ≜ max 𝜋∈Π 𝔼 𝜋[𝑟 𝑠, 𝑎 ] = max 𝜋∈Π 𝔼 𝑠0~𝑝0 𝑡=0 ∞ 𝛾 𝑡 𝑟 𝑠𝑡, 𝑎 𝑡 |𝜋 S : state r : reward function 𝑃0: starting state distribution A : action P : model 𝛾 : discount factor Goal: maximize cumulative rewards Fully specified MDP: value iteration & policy iteration RL for fully specified MDP 3
  • 5.
    Policy Gradient Theorem 𝛻𝜃𝑉(𝜃) = 𝔼 𝜋 𝜃 𝛻𝜃 𝑙𝑜𝑔𝜋 𝜃(𝑠, 𝑎)𝑄 𝜋 𝜃(𝑠, 𝑎) MDP Formulation  𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0x Model free RL 𝑉(𝜋 𝜃) = 𝔼 𝑠 𝜃~𝑝0 𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒 𝑟𝑒𝑤𝑎𝑟𝑑𝑠|𝜋 𝜃 𝜃 ← 𝜃 + 𝛼∇ 𝜃 𝑙𝑜𝑔𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 Dynamics of model p is unknown Sutton et al., ICML 1999 http://christinemcleavey.com/world-cup-reinforcement-learning/ The policy gradient methods target at modeling and optimizing the policy directly. The policy is usually modeled with a parameterized function respect to θ, πθ(a|s). 4
  • 6.
    MDP Formulation 𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0x MDP Formulation  𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0xx Model-given (small searching space environment) Real world problem (big searching space) Reward Engineering Update reward Compare with expert Run RL Reward engineering Right reward function r is unknown Dynamics and Right reward function is unknown 5
  • 7.
    Inverse RL Learn rsuch that: 𝝅∗ = 𝒂𝒓𝒈𝒎𝒂𝒙 𝜽 𝑬 𝒔~𝑷 𝒔|𝜽 𝒓(𝒔,𝝅 𝜽 𝒔 ) Assumes learning r is statistically easier than directly learning 𝜋∗ Direct policy learning Collect Demonstrations Supervised Learning Rollout in Environment Via interactive Demonstrator Requires interactive Demonstrator (BC is 1-step special case) Behavioral cloning Works well when P* close to P 𝜃 𝒂𝒓𝒈𝒎𝒊𝒏 𝜽 𝑬(𝒔,𝒂∗)~𝑷∗ 𝑳(𝒂∗ , 𝝅 𝜽 𝒔 ) Training loss Reward engineering https://bair.berkeley.edu/blog/202 0/04/03/laikago/ Robots Learning to Move like Animals 6
  • 8.
    Learning Direct Policy Learning Reward Learning Accessto Environment Interactive Demonstrator Pre-collected Demonstrations Behavioral Cloning Yes No No No Yes Direct Policy Learning (Interactive IL) Yes No Yes Yes Optional Inverse Reinforcement Learning No Yes Yes No Yes What does learning well on B imply about A? Behavioral CloningImitation Learning Reward engineering Summarization of reward engineering Learning Reductions 7
  • 9.
    To change linearconstraints to Lagrange CL by Ziebart et al., AAAI ‘08 1 𝑍(𝜃) 𝑒 𝑠 𝑡∈𝜏 𝜃 𝑇 𝜙(𝑠 𝑡) Maximum entropy principle: The probability distribution which best represents the current state of knowledge is the one with largest entropy. Assume: 𝑃 𝜏|𝜃 Maximum entropy IRL 𝑟 𝑠 = 𝜃 𝑇. 𝜙 𝑠𝑡 MaxEnt principle by E.T. Jaynes 1957 Maximum entropy Policy induces distribution over trajectories *Many reward function corresponding to the same policy *Many stochastic mixtures of policies correspond to the same feature expectation 8
  • 10.
    Dynamic programming: stateoccupancy measure (visitation freq.) 𝜃∗ = argmax 𝜃 𝐿 𝜃 = argmax 𝜃 𝜏 𝑖∈𝐷 𝑙𝑜𝑔𝑃(𝜏𝑖|𝜃) 𝛻𝜃 𝐿 𝜃 = 1 𝑚 𝜏 𝑖∈𝐷 𝜇 𝜏𝑖 − 𝑠 𝑑 𝑠 𝜃 𝜙 (𝑠) 𝜇 𝜏𝑖 = 1 𝑇 𝑠′∈𝜏 𝑖 𝜙 (𝑠′ ) 𝑑 𝑡+1,𝑠′ = 𝑎 𝑠 𝑑 𝑡,𝑠 𝜋 𝜃 𝑎 𝑠 𝑃(𝑠′|𝑠, 𝑎) Maximum entropy IRL learning Ziebart et al., AAAI ‘08 Reward inference: max log likelihood Gradient descent over log likelihood 9
  • 11.
    𝑆, 𝐴, 𝑃,𝑟, 𝛾, 𝑃0x x Model-free:Model-given: 𝑆, 𝐴, 𝑃, 𝑟, 𝛾, 𝑃0x Interact with environment / simulator (Model-free) Large continuous space Update reward Compare with expert Run RL Not Interact with environment Small and Given model State action reward IRL + model-free optimization 10
  • 12.
    reward 𝑟𝜃 𝑠𝑡,𝑎 𝑡 parameterized by neural net 𝑟 𝑠 𝜃 = 𝜃 𝑇 𝜙(𝑠) Recall MaxEnt formulation 1 𝑍(𝜃) 𝑒 𝑠 𝑡∈𝜏 𝜃 𝑇 𝜙(𝑠 𝑡) Reward function R by applying Bayes theorem, 𝑍 𝜃 = 𝑒 𝑟(𝜏|𝜃) 𝑑𝜏 Computing the normalizing constant Z’ is hard. However the sampling algorithms we will use for inference only need the ratios of the densities at two points, so this is not a problem. Deepak Ramachandran et al., IJCAI07 Bayesian IRL concept agent parameter 11
  • 13.
    Bayesian IRL usingneural network Google DeepMind 2015 “Weight Uncertainty in Neural Networks” Neural network as a probabilistic model P(y | x, w) Each weight is assigned a distribution Bayes Neural Network Input( state , action) Reward Priors distribution is assume a Gaussian (µ, σ2) Training dataset  ={x(i),y(i)}Construct the likelihood function p(|w)=I p(y(i)|x(i),w) p(w|)p(|w)p(w) Posterior distribution 12
  • 14.
    𝜃 𝑀𝐴𝑃 =argmax 𝜃 𝔼 𝜏~𝐷 log 𝑝 𝜏 𝜃 + log 𝑝(𝜃) : = argmax 𝜃 𝐽(𝜃) 𝜕 𝜕𝑥 𝐽 𝜃 = 𝔼 𝜏~𝐷 𝜕 𝜕𝜃 log 𝑝 𝜏 𝜃 + 𝜕 𝜕𝜃 log 𝑝(𝜃) 𝜕 𝜕𝑥 𝐽 𝜃 = 𝔼 𝜏~𝐷 (𝑠,𝑎)∈𝜏 𝜕 𝜕𝜃 𝑟𝜃 𝑠, 𝑎 − 𝜕 𝜕𝜃 𝑙𝑜𝑔𝑍′ + 𝜕 𝜕𝜃 log 𝑝(𝜃) 𝜕 𝜕𝜃 𝑙𝑜𝑔𝑍′ → (𝑠,𝑎)∈𝜏 𝑠 𝜕 𝜕𝜃 𝑟𝜃 𝑠, 𝑎where Bayesian IRL model free Forward of neural network Approximate the gradient with finite samples P(𝐷|𝑟) = 𝜏∈𝐷 (𝑠,𝑎)∈𝜏 𝑃(𝑟 𝑠, 𝑎 )𝑃 𝑟 = (𝑠,𝑎)∈𝜏 𝑃(𝑟 𝑠, 𝑎 ) likelihoodprior Min-hwan Oh et al., KDD ’19, August 4–8, 2019, Anchorage, AK, USA 13
  • 15.
    Bayesian IRL forNAB-dataset Chengqiang Huang et al., AAAI ‘18 RNN based RL and Bayes IRL for anomaly detection Rogue Agent Key Hold Data-set Temperature data Data size 1880 Unit hold 0~0.89 Data size 7268 starting from 2013/07/04 0:00:00 End at 2014/05/28 15:00:00 Time span 60minute 14
  • 16.
    Reinforcement learning Optimal policy Reward function Bayes IRL Expert trajectory Reward function Environment (MDP) Inverse/reinforcement 𝑅𝐿 𝑟= argmax 𝜋∈Π 𝐻 𝜋 + 𝔼 𝜋 𝑟(𝑠, 𝑎) 𝐼𝑅𝐿 𝜋 𝐸 = argmax 𝑟∈ℝ 𝑆×𝐴 𝔼 𝜋𝐸 𝑟(𝑠, 𝑎) − argmax 𝜋∈Π 𝐻 𝜋 + 𝔼 𝜋 𝑟(𝑠, 𝑎) Inverse/reinforcement collaboration samyzaf.com/ 15
  • 17.
    Reinforcement learning Optimal policy Bayes IRL Expert trajectory Inverse/reinforcement 𝑅𝐿 𝑟= argmax 𝜋∈Π 𝐻 𝜋 + 𝔼 𝜋 𝑟(𝑠, 𝑎) 𝐼𝑅𝐿 𝜋 𝐸 = argmax 𝑟∈ℝ 𝑆×𝐴 𝔼 𝜋𝐸 𝑟(𝑠, 𝑎) − argmax 𝜋∈Π 𝐻 𝜋 + 𝔼 𝜋 𝑟(𝑠, 𝑎) Inverse/reinforcement collaboration Reward function Environment (MDP) Experttrajectory samyzaf.com/ 16
  • 18.
    Start Target Movable place Restricted place Inverse/reinforcementbased learning Interactive maze environment 17
  • 19.
    IRL learning timereward curve iteration 0 5 10 15 20 25 1 19 37 55 73 91 Reward Iteration Best trajectory Including bad trajectory Less number trajectory 18
  • 20.
    -0.05 -0.045 -0.04 -0.035 -0.03 -0.025 -0.02 -0.015 -0.01 -0.005 0 1 6 1116 21 26 31 36 41 Reward Steps for target 16 steps to target 32 steps to target 26 steps to target 38 steps to target 44 steps to target Intentional or goal-oriented detection 16 steps to target 26 steps to target 32 steps to target 38 steps to target 44 steps to target 19
  • 21.
    -0.05 -0.045 -0.04 -0.035 -0.03 -0.025 -0.02 -0.015 -0.01 -0.005 0 1 6 1116 21 26 31 36 41 Reward Steps for target 16 steps to target 32 steps to target 26 steps to target 38 steps to target 44 steps to target 16 steps to target Bayes-IRL: Trajectory anomaly detection 26 steps to target 32 steps to target 38 steps to target 44 steps to target Test dataNormal trajectory Normal trajectory Tajectory-1 Tajectory-2 Tajectory-3 Tajectory-4 Tajectory-4 Tajectory-3 Tajectory-2Tajectory-1
  • 22.
    Summary Reward calculation usingmodel free IRL is one possible way to calculate the reward, Teacher data are needed for IRL and correct data helps to produce good reward, but it is very difficult to know which are good and which are bad.  Bayesian NN based IRL is good way for small data set because NN has overfitting demerits for small data-set. Future work So far, we implemented IRL for small interactive environment and we have not good idea about teacher data and all parameters of bayes neural network that helps to maximize the likelihood. So we search more detail in the same work 20

Editor's Notes

  • #3 This is the outline of this presentation First there is brief introduction about RL then Brief talk about reward calculation method Then I will Talk about max entropy After max entropy i will start bayes method At last I summarize the presentation
  • #4 Base on these objectives we work so far. We are using Bayesian neural network for Inverse reinforcement learning because it reduced the overfitting demerits for small data set. Expert data are very expensive and the are small in number. And reinforcement learning for intentional or goal oriented anomaly detection which is very difficult by using machine learning. the main objectives is using bayes inverse reinforcement learning for intentional anomaly detection.
  • #5 before introducing inverse reinforcement learning a very weak introduction to the standard reinforcement learning. we often utilize the markup decision process (MDP). an mdp is defined by state space s, action space a, model p that shows the transition dynamics of the environment. Reward function r, discount factor gamma, and starting state distribution as well. The goal of RL is to find a policy that maximize discounted reward which is v and some time we use action value function q which is at state s select action flowing policy pi to maximize the reward. The key thing here is that when we know all component of MDP we simply use dynamic programing techniques such as value iteration or policy iteration for reward maximization.
  • #6 Often though we don’t know the real dynamics of model p. but we can interact with environment. the one popular way to learn policy is use model free reinforcement learning. and there are many methods exist, here I'll just mention one where we parameterize a policy by a set of weight theta. then we try to optimize for the value function by taking the gradient with respect to Theta. the gradient of V with respect to theta is given by the policy gradient theorem. It transform gradient of V in into something that gradient depend on the policy which is not depend on the other dynamics.
  • #7 Now return to the our goal, first thing to note is that reinforcement learning typically assume that the reward function is given and that is just like video games, gym environment etc. where expert already defined the reward and environment dynamics. But when we want to work other than game there is not very clear environment as well as reward. almost real world problem have not clearly defined model and reward, so we need reward engineering. And there are many examples of unintended consequence of manually trying to making the reward function. But here we discuss about methods that helps to define the reward .
  • #8 Before move to IRL brief introduction of reward defining methods. There are mainly 3 types of reward engineering, first is behavior cloning which depends on collected set of expert data and then apply supervised learning to minimize an imitation loss so here's a loss function is mean square error and we don’t use regression. For direct policy learning, we apply supervised learning to learn a policy we roll out that policy in our environment and then the expert provide feedback on the rollout trajectories to create more training data. Inverse RL is learning reward instead of policy. It is assumed that learning reward is statistically easier than direct learning policy.
  • #9 Here A is General imitation learning and B is behavior cloning which is standard supervised learning. And behavior cloning is 1-step special case of direct policy learning. so we can say that using behavior cloning can reduced format of imitation learning. now I simply summarize the reward engineering methods we use pre-collected demonstrations for behavior cloning process it is also direct policy learning techniques. access to environment is need for direct policy learning, it interact with the environment so we get new data and collected data are optional. IRL need access to environment and it mainly focus on reward learning and it also need pre-collected demonstrations to design right reward. Mainly for model free purpose IRL is used.
  • #10 So far we talked about need of IRL and method for imitation. Here we know how to define reward when Many reward function corresponding to the same policy, Maximum entropy give us a principle way to solve this issue. Max-entropy is the principle for choosing large entropy without over-committing. we could express this Principle as a optimization constrained problem. The first constraints is policy pie induced trajectory sum should match expert feature and next is policy pie induced distribution over trajectory sum to be 1. Ziebart et al., AAAI ’08 used the linear constraints to the Lagrange form . here The z of theta is sum over all possible trajectory
  • #11 We can learn a linear reward by maximizing a log likelihood of observed trajectory. taking a log loglikelihood breaks the product in to the sum of conditional probability of expert trajectory using the parameter theta. We use gradient decent to maximize the log-likelihood. Here model is given case and what is linear the gradient can actually be expressed here. The first term is the average from trajectory and next one is collected data by help of policy pie. The author suggested to use dynamic programing for collecting data using policy pie. Here theta is the parameter.
  • #12 So far we focus on pretty simple setting we assume that the reward is linear to some known feature. For real problem, this requires essentially to know the Dynamics of the MDP which is not always realistic . Solving RL in the Inner Loop can be very expensive so now let's discuss generalized setting where we don't know both the reward and the Dynamics. consider complex reward function that can be represented by a neural network. we will assume that we have not access to full dynamics but to interacting with the environment simulator and we'll try to skill-up to large state space this is why we're going to do next 
  • #13 by recalling that max entropy helps to calculate probability of a trajectory with high reward is exponentially more likely to be sample than trajectory with low reward. Here reward parameter theta the conditional probability of trajectory is in the exponential form divided by the partition function. Which is sum over all possible trajectory. Which is similar after changing Bayes theorem to exponential form and in here we encounter with prior of reward which is basic term of bayes. This is for linear reward, it was presented by Deepak Ramachandran et al., IJCAI07, they also discussed about how to solve the normalizing constant problem in the bayes.
  • #14 For our work we used bayes neural network , A neural network can be viewed as probabilistic model . Each weight is assigned a distribution and prior is assumed a gaussian. For training the neural network we need data set that is training data and target data. And we construct the likelihood using the data and weight w. posterior distribution is Multiplying the likelihood with a prior distribution. Here our main goal is Maximizing the likelihood. The Bayesian neural network method is published by deep mind in 2015 name of weight uncertainty in the neural networks for fixing the over fitting problem in the neural network.
  • #15 The main idea of Bayesian IRL (BIRL) is to use a prior to encode the reward preference and to formulate the compatibility with the demonstrator’s policy as a likelihood in order to derive a probability distribution over the space of reward functions. the (log unnormalized) posterior over the reward function is then formulated by combining the prior and the likelihood. By using gradient method we update the parameter ,where Tau is a set of demonstrated trajectories from the target agent, and Tau S is a set of background trajectories the policy π generates. By the help of target and generated trajectory we update the parameters.
  • #16 This is implementation of Bayes IRL with RNN this RNN is presented in AAAI 2018 for anomaly detection. For anomaly detection purpose we used NAB dataset and based on the data set we calculate the reward and used it for RNN learning time. here the result, in the first plot of the rogue data set anomaly is correctly detected and there is clear difference between the anomaly data and other data. In the temperature data set bayes and RL don't give the clear anomaly detection it is may be the reason of all data look-like same, because IRL mainly focus on the data that we use for target.
  • #17 This is our work for model free IRL and RL combined collaboration the structure look like this flow chart. There are mainly three parts RL, MDP and IRL. RL always works for total reward maximization where IRL works for creating optimal reward for environment changes. And IRL use expert trajectory to obtain the correct reward.
  • #18 After combining the IRL and RL we get the structure look like this flow chart. RL use reward created by IRL, where IRL need environment state and RL action to maximize the log likelihood reward for that state action combination.
  • #19 We made the interactive maze environment. By using interactive maze environment we can easily collect different state action trajectory. we used different data shown in the slide for target trajectory. Using these data trajectory we used IRL to calculate reward for RL to learn the path from start to target point in the maze environment.
  • #20 Here the top plot is the result of reward progress by using IRL reward to RL and RL interact with environment. The high blue curve is using all the best trajectory that I show at last slide. And the curve in the near to zero is reward obtain by using bad trajectory, and the mid dim gray curve is using less number of trajectory at learning time. the animation in the left is the starting state is same at the learning time , and this present the agent action at each state to reach to the goal. And right animation is changing the starting condition of the maze and testing the agent action to reach the target. Each time age is able to reach the target.
  • #21 We are at the end, our goal was to detect the intentional anomaly, here we are. The graph is representation of total reward for different length of action to reach the target. Here in the first 16 steps is normal reward curve and utile 32 steps the reward curve is look-like same shape. But when steps are greater than 32 the total reward decease dramatically and same thing we can observe in the Maze environment with action taken by the agent. In short we can say that if intentionally agent take long root IRL shows intentional by its reward.
  • #23 By summarizing our work we used bayes IRL for reward calculation it is one possible way to calculate reward, it is very fruitful for small trajectory because there is possibility of overfitting. We used IRL reward based learning in the data based learning as well as interactive environment. So far we are using small interactive environment. For future work we are not confident does it works for the big interactive environment so further more we want to continue the work.