Information Theoretic aspect of reinforcement learning

information theoretic
aspect of reinforcement
learning
HA JONG SU

TABLE
• PAC
• MULTI SKILL RL

PAC (Probably
Approximately
Correct)
PAC = Theoretical base of
occam's razor

PAC (Probably Approximately Correct)
R : error of learned model with
given all data
S: subset of data
H : hypothetical space
h : learned parameter, trained
model
The bound decreases when the
data increases or the hypothetical
space narrows

Hoeffding’s inequality
“Moment generating function”
http://cs229.stanford.edu/extra-
notes/hoeffding.pdf
Infinite data is given

PAC (Probably Approximately Correct)

difference between RL and other
algorithms
HMM, CRF, LSTM, GRU, Memory network, NTM, DNC has their own memory structure which is called
“hidden state” and play same role as “state” in RL thus make model to have low hypothetical space
CNN-like-algorithms deploy parameters to accommodate different data points with each other. The reason
why hinton write Capsule net saying “we need equivalent not invariant” which has memory structure

Simple implementations
GRU implementation from the script with tensorflow
https://github.com/kkugosu/PYTHON-Tensorflow---Jupyter---gru
CRF-GRU implementation with pytorch
https://github.com/kkugosu/PYTORCH---
dialogue_intention_extraction/blob/master/source_code/model/discrete_mo
del/comp_model.py

Limitation of PAC
Bounds are vacuous
Single parameter in neural network is encoded with float32
• 60parameter : H = 2^(32*60)
So we need quantize the network to use this bounds.
• What if even parameter is distribution??

Bayesian neural network
Gaussian process approximation
Explanation and python implementation
https://github.com/kkugosu/bayesian_neural_network

PAC-bayes
To estimate the performance of a model, which has an output composed of a distribution like
Bayesian neural networks...

Learning dynamics of
RL
• Model don’t need many number of data as increasing
bias, like 1-step td training, high assumption, but pose
high probability of distributional shift
• If they have high Variance, like Monte-Carlo or naïve PG,
their results are not accurate also because they need
tremendous number of data.

Learning dynamics of RL
• Distributional shift-like problem is a chronic issue specific to
generative models. Also occurring in generative models like
GANs. it is called 'Model collapse,' which is a problem same as
distributional shift
• An iterative process of making assumptions upon assumptions.
• Expectation <-> maximization
• Q assumption based on policy <-> policy update based on Q

• Low assumption(need lots of data), Sufficient exploration <-> High
assumption (distributional shift)
• What if we are short of data and inevitably have to use a highly
biased model?
• We have to deal with distributional shift!

We can find optimal n-step to negotiate
these problems

Make model to be Robust with regularization term which usually
restrict changing rate of distribution (decreasing hypothetical space pac
bayes)

TRPO (local optimality)
Ensure the update of J -> Restrict the
changing rate of posterior distribution

Another basic RL algorithms
• Pytorch implementation of PG, DQN, AC, DDPG, TRPO, PPO, SAC
• https://github.com/kkugosu/RL_BASIC

How to secure global optimality
We restricted distribution so lowing hypothetical space
But we can’t say optimality because We still can’t use PAC theory.
The problem is that Data depends on the function
We can solve this problem by represent uncertainty as distribution
• (Bayesian reinforcement learning, distributional reinforcement
learning)

PAC-MDP (global optimality)
• E3 -> Near-Optimal Reinforcement Learning in Polynomial Time
• R-max –> A General Polynomial Time Algorithm for Near-Optimal
Reinforcement Learning
• On the Sample Complexity of Reinforcement Learning (R max modify)
• MBIE -> A theoretical analysis of Model-Based Interval Estimation
• MBIE–EB -> An analysis of model-based Interval Estimation for Markov
Decision Processes
• DELAY-Q -> PAC Model-Free Reinforcement Learning
• Reinforcement Learning in Finite MDPs: PAC Analysis
• PAC-inspired Option Discovery in Lifelong Reinforcement Learning

PAC-MDP
• PAC Continuous State Online Multitask Reinforcement Learning
with Identification (continuous space)

MBIE–EB (exploratory bonus) (2008)
original
IE version

MBIE-EB
CI for reward
CI for trajectory

MBIE-EB
Bound of performance given by CI of reward and trajectory

Sham
Machandranath
Kakade
• Hidden monsters in RL theory
field

Thompson sampling
Variance decrease as sample
increase -> uncertainty
decrease

LQR
Not learning
or
approximation
process
F and C are
given
And calculate
K which is
policy

iLQR
Learning
process
F and C is not
correct but
learned

PILCO
(2011)
PILCO (GP + iLQR)

Deep PILCO
Bayesian NN with sampling
We can’t use iLQR because BNN can’t expressed as computable
function

GPS
• modeling environment with Bayesian NN
• Training policy with iLQR at the same time
• Dual gradient descent makes use of strong duality in convex
optimization.
• https://github.com/kkugosu/RL_MODELBASED

PAC-Bayes for MDP (2022.11)
• PAC-Bayes Bounds for Bandit Problems: A Survey and Experimental Comparison
• https://arxiv.org/pdf/2211.16110.pdf

Auto ML
• HPO (Hyper Parameter Optimization)
• NAS (neural architecture search)
• Finding most plausible H
• meta learning
• Increasing number of data

Meta Reinforcement Learning
• Transfer learning
• hierarchical multi skill RL
• contrastive multi skill RL
• …
• They improve their performance by solving multiple task at the same
time or accumulatively

Contrastive-Based Multi-Skill RL
• Consider how much certain skill contributes compared to the
entire skill set
• maximizing IG(information gain)
• makes skills to push each other

Contrastive-Based Multi-Skill RL
“Theoretically distributional shift doesn’t occur in this method,
Flawless theory”

diayn
They aren't given a certain task, but they are trained in various
skills to differentiate themselves from each other. The main
objective is to achieve distinctiveness and uniqueness compared
to other skills

diayn
They are given reward to
maximize information gain
Regularization term
strengthen feature
embedding at the
same time

diayn
• learning “log_probability” with neural network inherently has
tuberance which is inaccuracy in approximation

diayn
• Strenghten feature towards skills that encompass that particular
state with that “turberance” makes model to fall in local minimal
easily which might be called “distributional shift problem” in
reinforcement learning
• So they added regularization term but it seems to be insufficient.
This model was short of coverage in state space

Edl, smm
• SMM: Efficient Exploration
• via State Marginal Matching
• they try to locate skill on the uncovered area

Edl, smm
• To adderess insufficient coverage problem, they even just fix
the p(s)
• EDL: Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills

apt
Give up to learn distinctiveness of each skill,
still strengthen feature embedding but through Learning Contrastive
Representations not by distinctiveness
they just maximize entropy of state which is occupied of skills

apt
Give reward to
maximize the
entropy of state in
the space of k-
nearest neighbors

THE INFORMATION GEOMETRY OF
UNSUPERVISED REINFORCEMENT LEARNING
• Fixed p(s)
• -> lowering the understanding of environment
• -> learned skill becomes far from optimal so we have to train and find
more and more skill
• But if we don’t fix the p(s), the learned skills are optimal to certain
reward function.
• They implemented contrastive based multi skill rl on “3 state mdp”
without regularization term. And found out that is optimized on some
reward function which is not given while learning (not every reward
function), number of skill doesn’t increase at some point

THE INFORMATION GEOMETRY OF
UNSUPERVISED REINFORCEMENT
LEARNING

Implementation
• I implemented every contrastive multi skill algorithm on my github
with pytorch
• Vic, Diayn, Dads, Edl, Visr, Valor, Apt, Aps, Cic….
• https://github.com/kkugosu/RL_META

reference
• https://www.youtube.com/playlist?list=PL_iWQOsE6TfURIIhCrlt-wj9ByIVpbfGc
• https://www.youtube.com/watch?v=t5GBuBD0ibc
• https://www.youtube.com/watch?v=ar9RLwgUvVQ
• https://www.edwith.org/bayesiandeeplearning
• https://www.sciencedirect.com/science/article/pii/S0022000008000767
• https://web.stanford.edu/class/cs234/CS234Win2020/slides/lecture13.pdf
• https://arxiv.org/abs/1802.06070
• …

Information Theoretic aspect of reinforcement learning

Recommended

Recommended

More Related Content

Similar to Information Theoretic aspect of reinforcement learning

Similar to Information Theoretic aspect of reinforcement learning (20)

Recently uploaded

Recently uploaded (20)

Information Theoretic aspect of reinforcement learning