Deep Reinforcement Learning for control of PBNs--CNA2020

Deep Reinforcement Learning for Control
of Probabilistic Boolean Networks
Georgios Papagiannis1 and Sotiris Moschoyiannis2
1University of Cambridge, UK
2University of Surrey, UK
Complex Networks and their Applications 2020 – 1 Dec 2020
s.moschoyiannis@surrey.ac.uk

Boolean Networks1 (BNs)
2
n1
n4
n3
n2
n5
A class of discrete dynamical systems:
• Nodes represent genes,
gene expression is quantized: 0 (inactive), 1 (active)
• Expression level of each gene is functionally related to
the expression states of some other genes
• At each time step,
each node computes and produces output (0 or 1),
which is input for its connected nodes in the next time step
AND
OR
Corresponding state space
Boolean network (BN)
1 Kauffman (1969) Metabolic stability and epigenesis in randomly constructed genetic nets, J. of Theoretical Biology, 22(3):437-467

Attractors
3
n1
n4
n3
n2
n5
Dynamics of BNs dictate that the network will
evolve to a state, or set of states, that it cannot
leave without external intervention
• Fixed point attractors
• Limit cycle attractors

4
n1
n4
n3
n2
n5
AND, p=0.5
OR, p=0.3
NAND, p=0.2
Probabilistic Boolean Networks2 (PBNs)
More than one Boolean function at each node;
one function executes at each step t, with prob. p
Accommodate uncertainty in gene regulation.
• Dynamics of PBNs
- admit Markov Chain theory (MDPs)
- exhibit attractors; these manifest as:
• absorbing states
• irreducible sets
2 Shmulevich et al (2002) Probabilistic Boolean Networks: a rule-based uncertainty model for gene regulatory networks , Bioinformatics 18(2):261-274

Gene Regulatory Networks (GRNs)
5
Segment polarity genes5
Fission yeast cell-cycle6
• Spontaneous emergence of ordered collective behaviour 3
e.g., functional states of the cell such as growth or quiescence
correspond to such attractors 3
e.g., high / low resistance to antibiotics at different attractors 4
• (Why PBN study is useful) Targeted therapeutics: external
perturbation on certain gene(s), at certain state(s), can drive the
GRN to a desirable attractor (drug targets)
• where perturbation = change of state (i.e., 0->1, 1-> 0)
• Kauffman: attractors are stable under most gene perturbations
Study PBNs as a dynamical system where change of state on
certain genes, at certain states, may drastically affect the state
of the network as a whole, and
• lead to a different attractor, with desirable properties
• switch between attractors
3 Huang, Ingber (2000) Shape-dependent control of cell growth: Switching between attractors in cell regulatory networks. Experimental Cell Research, 261(1): 91-103
4 Reardon (2017) Modified viruses deliver death to antibiotic-resistant bacteria. Nature, 546:586-587
5 Albert, Othmer (2003) The topology of the regulatory interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster. J. of Theoretical Biology, 223(1):1-18
6 Wang, Du, Chen, et al (2010) Process-based network decomposition reveals backbone motif structure. Proc Natl Acad Sci 107(23):10,478–10,483

Control
Complex Networks perspective:
A dynamical system is controllable if it can be driven from
any initial state to any desired state, within finite time 7
7 Liu, Slotine, Barabasi (2011) Controllability of complex networks, Nature 473: 167-173
Our goal:
Discover control strategies to effect perturbations on individual nodes
(targeted intervention), aiming to drive the whole network from its current
state to a specified target state that exhibits desirable (biological) properties.
6

Control in (P)BNs
Comes in different flavours.
e.g.,
Assume control inputs,
intervene only using these 8
Intervene on one node to
affect another node’s state 9
Intervene on any node to affect
the long-run network behaviour3
7
8 Wu, Guo, Toyoda (2020) Policy Iteration Approach to the Infinite Horizon Average Optimal Control of Probabilistic Boolean Networks, IEEE Trans, Neural Netw. Learn. Syst.
9 Pal, Datta, Dougherty (2006) Optimal Infinite-Horizon Control for Probabilistic Boolean Networks. IEEE Trans on Signal Processing, 54(6):2375-2387
10 Shmulevich, Dougherty, Zhang (2002) Gene Perturbation and Intervention in PBNs. Bioinformatics 18(10):1319-1331
Lac operon on E. coli 8
Metastatic melanoma 9
Toy example from Shmulevich 10

Control (in our work, here)
What is the series of required interventions (which gene, at which step) to drive a
PBN from any state towards a target attractor, within a finite number of steps ?
- Can intervene on any node
- Can intervene on at most one node at each time step
- Each intervention is followed by a natural evolution step (internal dynamics)
- Aim for minimum number of interventions (perturbations)
- Limit the number of steps (or num of interventions)
- Assume no additional info from systems biology study
- One requirement: knowledge of the target attractor
8

STG / Probability Transition Matrix
Intractable
9
Lac operon E.coli PBN
n = 9 nodes
Corresponding State Transition Graph (STG) : 29 = 512 states
Corresponding Probability Transition Matrix (PTM) : 29 x 29

10
Corresponding STG : 210 = 1024 statesCorresponding Probability Transition Matrix : 210 x 210
Fission yeast PBN
n = 10 nodes

11
Corresponding STG : 220 = 1,048,576 states
Corresponding Probability Transition Matrix : 220 x 220
Synthetic PBN
n = 20 nodes
(in this paper – see Section 4)

Corresponding STG : 228 = 268,435,456 statesCorresponding Probability Transition Matrix : 228 x 228
Metastatic Melanoma PBN
n = 28 nodes
The Probability Transition Matrix (PTM) becomes
computationally intractable for larger networks 11 –
requires the estimation of 2n x (2n – 1) probabilities..
Can we work without the PTM ?
11 Akutsu, Hayashida, et al (2007) Control of Boolean networks: Hardness results and algorithms for tree structured network. Journal of Theoretical Biology, 244(4):670-679

How
Formulate the problem as one of reward maximization and
use Reinforcement Learning.
13

Reinforcement Learning
Policy 𝜋(𝑠) : How to select an action, at each state,
a distribution over actions given a state.
Goal: Maximize expected cumulative reward.
1412 Sutton, Barto (2018) Reinforcement Learning: an introduction, MIT Press [Chapter 6]

Markov Decision Process (MDP)
An MDP is a tuple (𝑆, 𝐴, 𝑃, 𝑅, 𝛾)
𝑆 – Set of states of the environment
𝐴 – Set of possible actions to perform
at some state s ∈ 𝑆
𝑃 – State transition matrix where
𝑃𝑠 𝑡 𝑠 𝑡+1
′
𝑎∈𝐴
= 𝑃[𝑠𝑡+1
′
|𝑠𝑡, 𝑎 𝑡]
𝑅 – Reward function where 𝑅 𝑠
𝑎 =
Ε[𝑅𝑡+1|𝑠𝑡, 𝑎 𝑡]
𝛾 – Discount factor 𝛾 ∈ [0, 1]
PBNs as MDPs:
𝑆 – Binary states
𝐴 – Possible interventions given a state s ∈ 𝑆
(in fact, N+1 actions at each state)
𝑃 – Probability of transitioning between
binary states, given Boolean function realisations
𝑅 – Problem dependent
(we define the reward function)
𝛾 – Problem dependent
(we choose this)
15

PTM
Q-Learning
16
Temporal-Difference (TD) learning.
Model–free.
Off-policy.
N.B. ‘Q’ in Q-Learning stands for Quality, or value, of an action
X

Q-Learning
1712 Sutton, Barto (2018) Reinforcement Learning: an introduction, MIT Press [Chapter 6]
Expected reward of taking action at state , at time step t :
target for update
error in estimate
increment – sample update
old estimatenew estimate

18
But, how do I go to the true value ?

ε-greedy
19
Simplest idea for ensuring continual exploration.
All m actions available are tried with non-zero probability:
with probability ε choose an action at random (define small ε)
with probability 1 – ε choose the greedy action
where a greedy action is an action whose expected reward is the greatest,
and is given by argmax
Approximate iteratively, by selecting actions at each time step.
N.B. Q has been shown to converge to Q* with ε-greedy e.g., see Sutton, Barto (2018) Reinforcement Learning: an introduction, MIT Press

20
Q-Learning implies storing each state-action pair (the Q values).

21
Use a function approximator to learn a parameterised form Q(s, a; θ).
Use DQN to iteratively update θ, in order to approximate Q*(s, a; θ) (true Q values).
Deep Q Net (DQN)

22
TD-Learning is often susceptible to large oscillations in expected Q values.

23
Use a separate network to determine the TD-target.
Double DQN (DDQN)
• “target” DQN – initialised with the same parameters as the main DQN (“policy” DQN) but
has its parameters updated every k iterations
• “policy” DQN – the expected Q values of the target DQN are fixed and every k iterations the
parameters of the policy DQN are copied to the target
used to update the θ parameters,
θt
’ = θt every k time steps.

Reward
Objective: Find a policy that drives a PBN to an attractor in order to maximise reward
: set of states in target attractor
24

25
Training a network from consecutive samples directly from the environment is
susceptible to strong correlations in the data.

26
Sample from a batch of experiences, at each time step t, to update the DDQN.
Prioritised Experience Replay (DDQN with PER)
During training,
• the agent observes state , performs action on the environment, and
• then, environment transitions to and agent receives reward
The transition / experience is stored in a replay buffer
• 5K buffer (for n=10 nodes); 500K (for n=20)
At each t, a batch of experiences is sampled in order to update the network parameters
• 128 for n=10; 512 for n=20

27N.B. Please see Section 3.2 (pp. 4-5) in the paper for more detail.
PER – proportional, importance
• Proportional - probability of an experience being sampled given by
where is the priority of sample i, δ is the TD-error, c is small constant to prevent
experiences with zero TD-error from never being replayed, and ω is the magnitude of prioritisation
• Importance weights used to compensate for samples with high TD-error sampled more often
where L is the size of replay memory and β is used to anneal the amount of
importance sampling over training episodes.

28
Does it work?
Does it work for real GRNs?

Results
Success rate is at least 99% from any initial state
29
- 1024 states
- attractor occurs 1/100 times
- random interventions: 1,387
- horizon of 11 is < 1% of that
- DRL: 99.8% successful control
- 100% if horizon is set to 14
- 1,048,576 states
- attractor occurs 1/10,000 times
- random interventions: 6,511
- horizon of 100 is <1.5% of that
- DRL: 100% successful control
- 99% if horizon is set to 15
- 512 states
- attractor set to 1001111 motivated
by biology (2nd gene, WNT5A,
unexpressed)
- DRL: 100% successful control for 10
- 99.72% if horizon is set to 7
PBN20PBN10 Melanoma

Not in this paper – control larger PBNs
We have tried our DRL (DDQN with PER) method
on the more common type of control problem 13,14
30
OFF
Intervene on pirin’s state only, aiming to drive the
PBN to a state where WNT5A is OFF (target state).
Cancerous Melanoma PBN inferred from GRN data 13,14
14 Sirin, Polat, Alhajj (2013) Employing Batch Reinforcement Learning to Control Gene Regulation Without Explicitly Constructing Gene Regulatory Networks, 23rd IJCAI 2013, 2042-2048

Not in this paper – control larger PBN (N= 70)
On N=7 we get favourable performance
to existing literature 13,14
On N=28 we get favourable performance
to existing literature 14
On N=70 we get 97.6% successful
control.
This is the largest PBN to be controlled,
from real data or synthetic data.
Joint work with Vytenis Sliogeris, paper under preparation
14 Sirin, Polat, Alhajj (2013) Employing Batch Reinforcement Learning to Control Gene Regulation Without Explicitly Constructing Gene Regulatory Networks, 23rd IJCAI, 2042-2048

Not in this paper – infer larger PBNs
We have been successful in inferring a PBN directly from real gene expression data
(samples taken when network in a steady-state distribution)
• Metastatic melanoma dataset from Bittner et al1
• Using CoDs and a perceptron as a predictive model2,3
Our approach does not build the PTM (as our control method does not need it!).
We are looking at inferring a PBN from real, time-series gene expression data
• But, typically, studies provide no more than 6-7 time steps
This is in progress – please get in touch if you are also working on something like this.
32
1 Bittner, Meltzer, Chen et al (2000) Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406: 536-540
2 Kim,Dougherty, et al (2000) General nonlinear framework for the analysis of gene interaction via multivariate expression arrays. Journal of Biomedical Optics 5: 411-424
3 Shmulevich, Dougherty, Zhang (2002) Gene Perturbation and Intervention in PBNs. Bioinformatics 18(10): 1319-1331

Thank you for listening.
Any questions?
s.moschoyiannis@surrey.ac.uk

Deep Reinforcement Learning for control of PBNs--CNA2020

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to Deep Reinforcement Learning for control of PBNs--CNA2020

Similar to Deep Reinforcement Learning for control of PBNs--CNA2020 (20)

Recently uploaded

Recently uploaded (20)

Deep Reinforcement Learning for control of PBNs--CNA2020

Editor's Notes