Successfully reported this slideshow.
Upcoming SlideShare
×

# (DL hacks輪読) Deep Kalman Filters

• Full Name
Comment goes here.

Are you sure you want to Yes No

### (DL hacks輪読) Deep Kalman Filters

1. 1. Deep Kalman Filters
2. 2. ¤ Rahul G. Krishnan, Uri Shalit, David Sontag arXiv, 2015/11/16 ¤ ¤ Deep Learning + Kalman Filter ¤ VAE ¤ Deep… ¤
4. 4. ¤ ¤ ¤ ¤ ¤
5. 5. ¤ ¤ Kalman Filters Assume we have a sequence of unobserved variables unobserved variable zt we have a corresponding observation xt 2 Rd ut 2 Rc , which is also observed. In the medical domain, the variables z of a patient, the observations xt indicate known diagnoses and lab te correspond to prescribed medications and medical procedures which a patient. The classical Kalman ﬁlter models the observed sequence x1, . zt = Gtzt 1 + Btut 1 + ✏t (action-transition) , xt = Ftz where ✏t ⇠ N(0, ⌃t), ⌘t ⇠ N(0, t) are zero-mean i.i.d. normal ran ance matrices which may vary with t. This model assumes that the l transformed at time t by the state-transition matrix Gt 2 Rs⇥s . The e is an additive linear transformation of the latent state obtained by addi Bt 2 Rs⇥c is known as the control-input model. Finally, the observ from the latent state via the observation matrix Ft 2 Rd⇥s . In the following sections, we show how to replace all the linear tran transformations parameterized by neural nets. The upshot is that the n 2 a sequence of unobserved variables z1, . . . , zT 2 Rs . For each orresponding observation xt 2 Rd , and a corresponding action the medical domain, the variables zt might denote the true state dicate known diagnoses and lab test results, and the actions ut ns and medical procedures which aim to change the state of the models the observed sequence x1, . . . xT as follows: (action-transition) , xt = Ftzt + ⌘t (observation), t) are zero-mean i.i.d. normal random variables, with covari- h t. This model assumes that the latent space evolves linearly, ansition matrix Gt 2 Rs⇥s . The effect of the control signal ut of the latent state obtained by adding the vector Btut 1, where l-input model. Finally, the observations are generated linearly tion matrix Ft 2 Rd⇥s . w how to replace all the linear transformations with non-linear eural nets. The upshot is that the non-linearity makes learning 2 much more challenging, as the posterior distribution p(z1, . . . zT |x1, . . . , xT , u1, . . . , uT ) beco ntractable to compute. tochastic Backpropagation In order to overcome the intractability of posterior inference, we m se of recently introduced variational autoencoders (Rezende et al. , 2014; Kingma & Welling, 20 o optimize a variational lower bound on the model log-likelihood. The key technical innova the introduction of a recognition network, a neural network which approximates the intracta osterior. et p(x, z) = p0(z)p✓(x|z) be a generative model for the set of observations x, where p0(z) is rior on z and p✓(x|z) is a generative model parameterized by ✓. In a model such as the one osit, the posterior distribution p✓(z|x) is typically intractable. Using the well-known variatio rinciple, we posit an approximate posterior distribution q (z|x), also called a recognition mod ee Figure 1a. We then obtain the following lower bound on the marginal likelihood:
6. 6. Stochastic Backpropagation ¤ SGVB ¤ Variational autoencoder(VAE) se of recently introduced variational autoencoders (Rezende et al. , 2014; Kingma & Welling, 20 o optimize a variational lower bound on the model log-likelihood. The key technical innovat s the introduction of a recognition network, a neural network which approximates the intracta osterior. Let p(x, z) = p0(z)p✓(x|z) be a generative model for the set of observations x, where p0(z) is rior on z and p✓(x|z) is a generative model parameterized by ✓. In a model such as the one osit, the posterior distribution p✓(z|x) is typically intractable. Using the well-known variatio rinciple, we posit an approximate posterior distribution q (z|x), also called a recognition mod ee Figure 1a. We then obtain the following lower bound on the marginal likelihood: log p✓(x) = log Z z q (z|x) q (z|x) p✓(x|z)p0(z)dz Z z q (z|x) log p✓(x|z)p0(z) q (z|x) dz = E q (z|x) [log p✓(x|z)] KL( q (z|x)||p0(z) ) = L(x; (✓, )), where the inequality is by Jensen’s inequality. Variational autoencoders aim to maximize the lo ound using a parametric model q conditioned on the input. Speciﬁcally, Rezende et al. (201 Kingma & Welling (2013) both suggest using a neural net to parameterize q , such that are arameters of the neural net. The challenge in the resulting optimization problem is that the lo ound (1) includes an expectation w.r.t. q , which implicitly depends on the network parame . This difﬁculty is overcome by using stochastic backpropagation: assuming that the latent s s normally distributed q (z|x) ⇠ N (µ (x), ⌃ (x)), a simple transformation allows one to ob Monte Carlo estimates of the gradients of Eq (z|x) [log p✓(x|z)] with respect to . The KL term 1) can be estimated similarly since it is also an expectation. If we assume that the prior p0(z lso normally distributed, the KL and its gradients may be obtained analytically. Let p(x, z) = p0(z)p✓(x|z) be a generative model for the set of observations x prior on z and p✓(x|z) is a generative model parameterized by ✓. In a model posit, the posterior distribution p✓(z|x) is typically intractable. Using the we principle, we posit an approximate posterior distribution q (z|x), also called a see Figure 1a. We then obtain the following lower bound on the marginal likeli log p✓(x) = log Z z q (z|x) q (z|x) p✓(x|z)p0(z)dz Z z q (z|x) log p✓(x|z q ( = E q (z|x) [log p✓(x|z)] KL( q (z|x)||p0(z) ) = L(x; (✓, where the inequality is by Jensen’s inequality. Variational autoencoders aim to bound using a parametric model q conditioned on the input. Speciﬁcally, Re Kingma & Welling (2013) both suggest using a neural net to parameterize q parameters of the neural net. The challenge in the resulting optimization prob bound (1) includes an expectation w.r.t. q , which implicitly depends on the . This difﬁculty is overcome by using stochastic backpropagation: assuming is normally distributed q (z|x) ⇠ N (µ (x), ⌃ (x)), a simple transformation Monte Carlo estimates of the gradients of Eq (z|x) [log p✓(x|z)] with respect to (1) can be estimated similarly since it is also an expectation. If we assume th also normally distributed, the KL and its gradients may be obtained analytically Counterfactual Estimation Counterfactual estimation is the task of inferring
7. 7. Counterfactual Estimation ¤ ¤ ¤ ¤ ¤ ¤ !( | ) ¤ ¤ ¤
8. 8. Pearl(2009) ¤ do ! % & → ! % () & ¤ & % ¤ & % ¤ do ¤ ¤
9. 9. ¤ ¤ ¤ ¤ ¤ generative model to a sequence of observations and actions, motivated by the th record data. We assume that the observations come from a latent state which We assume the observations are a noisy, non-linear function of this latent state. ume that we can observe actions, which affect the latent state in a possibly e of observations ~x = (x1, . . . , xT ) and actions ~u = (u1, . . . , uT 1), with states ~z = (z1, . . . , zT ). As previously, we assume that xt 2 Rd , ut 2 Rc , and tive model for the deep Kalman ﬁlter is then given by: z1 ⇠ N(µ0; ⌃0) zt ⇠ N(G↵(zt 1, ut 1, t), S (zt 1, ut 1, t)) xt ⇠ ⇧(F(zt)). (2) hat the distribution of the latent states is Normal, with a mean and covariance unctions of the previous latent state, the previous actions, and the time different 4 e of observations and actions, motivated by the t the observations come from a latent state which e a noisy, non-linear function of this latent state. ions, which affect the latent state in a possibly . . , xT ) and actions ~u = (u1, . . . , uT 1), with reviously, we assume that xt 2 Rd , ut 2 Rc , and n ﬁlter is then given by: ), S (zt 1, ut 1, t)) (2) nt states is Normal, with a mean and covariance state, the previous actions, and the time different Model r goal is to ﬁt a generative model to a sequence of observations and actions, motivated by the ure of patient health record data. We assume that the observations come from a latent state which olves over time. We assume the observations are a noisy, non-linear function of this latent state. nally, we also assume that we can observe actions, which affect the latent state in a possibly n-linear manner. note the sequence of observations ~x = (x1, . . . , xT ) and actions ~u = (u1, . . . , uT 1), with responding latent states ~z = (z1, . . . , zT ). As previously, we assume that xt 2 Rd , ut 2 Rc , and 2 Rs . The generative model for the deep Kalman ﬁlter is then given by: z1 ⇠ N(µ0; ⌃0) zt ⇠ N(G↵(zt 1, ut 1, t), S (zt 1, ut 1, t)) xt ⇠ ⇧(F(zt)). (2) at is, we assume that the distribution of the latent states is Normal, with a mean and covariance ich are nonlinear functions of the previous latent state, the previous actions, and the time different 4 Model goal is to ﬁt a generative model to a sequence of observations and actions, motivated ure of patient health record data. We assume that the observations come from a latent state lves over time. We assume the observations are a noisy, non-linear function of this latent ally, we also assume that we can observe actions, which affect the latent state in a po -linear manner. note the sequence of observations ~x = (x1, . . . , xT ) and actions ~u = (u1, . . . , uT 1) esponding latent states ~z = (z1, . . . , zT ). As previously, we assume that xt 2 Rd , ut 2 R 2 Rs . The generative model for the deep Kalman ﬁlter is then given by: z1 ⇠ N(µ0; ⌃0) zt ⇠ N(G↵(zt 1, ut 1, t), S (zt 1, ut 1, t)) xt ⇠ ⇧(F(zt)). t is, we assume that the distribution of the latent states is Normal, with a mean and cova ch are nonlinear functions of the previous latent state, the previous actions, and the time dif 4
10. 10. VAE ¤ ¤ ¤ DKF 0 0 d the parameters of the generative model. We use a diagonal covariance matrix S (·), and employ a log-parameterization, thus ensuring that the covariance matrix is positive-deﬁnite. The model is presented in Figure 1b, along with the recognition model q which we outline in Section 5. The key point here is that Eq. (2) subsumes a large family of linear and non-linear latent space models. By restricting the functional forms of G↵, S , F, we can train different kinds of Kalman ﬁlters within the framework we propose. For example, by setting G↵(zt 1, ut 1) = Gtzt 1 + Btut 1, S = ⌃t, F = Ftzt where Gt, Bt, ⌃t, Ft are matrices, we obtain classical Kalman ﬁl- ters. In the past, modiﬁcations to the Kalman ﬁlter typically introduced a new learning algorithm and heuristics to approximate the posterior more accurately. In contrast, within the framework we propose any parametric differentiable function can be substituted in for one of G↵, S , F. Learning any such model can be done using backpropagation as will be detailed in the next section. x z ✓ (a) Variational Autoencoder Deep Kalman Filters Rahul G. Krishnan Uri Shalit David Sontag Courant Institute of Mathematical Sciences New York University November 25, 2015 x1 x2 . . . xT z1 z2 zT u1 . . . uT 1 q (~z | ~x, ~u) Figure 1: Deep Kalman Filter (b) Deep Kalman Filter Figure 1: (a): Learning static generative models. Solid lines denote the generative model p0(z)p✓(x|z), dashed lines denote the variational approximation q (z|x) to the intractable posterior p(z|x). The variational param- eters are learned jointly with the generative model parameters ✓. (b): Learning in a Deep Kalman Filter. A parametric approximation to p✓(~z|~x), denoted q (~z|~x, ~u), is used to perform inference during learning. x Variational Autoencoder x1 x2 . . . xT q (~z | ~x, ~u) Figure 1: Deep Kalman Filter 1 (b) Deep Kalman Filter e 1: (a): Learning static generative models. Solid lines denote the generative model p0(z)p✓(x|z), dashed denote the variational approximation q (z|x) to the intractable posterior p(z|x). The variational param- are learned jointly with the generative model parameters ✓. (b): Learning in a Deep Kalman Filter. A etric approximation to p✓(~z|~x), denoted q (~z|~x, ~u), is used to perform inference during learning. Learning using Stochastic Backpropagation Maximizing a Lower Bound m to ﬁt the generative model parameters ✓ which maximize the conditional likelihood of the given the external actions, i.e we desire max✓ log p✓(x1 . . . , xT |u1 . . . uT 1). Using the vari- al principle, we apply the lower bound on the log-likelihood of the observations ~x derived in 1). We consider the extension of the Eq. (1) to the temporal setting where we use the following ization of the prior: q (~z|~x, ~u) = TY t=1 q(zt|zt 1, xt, . . . , xT , ~u) (3)
11. 11. ¤ ¤ ¤ ¤ 1 astic Backpropagation und del parameters ✓ which maximize the conditional likelihood of the i.e we desire max✓ log p✓(x1 . . . , xT |u1 . . . uT 1). Using the vari- lower bound on the log-likelihood of the observations ~x derived in on of the Eq. (1) to the temporal setting where we use the following ~z|~x, ~u) = TY t=1 q(zt|zt 1, xt, . . . , xT , ~u) (3) orization of q in Section 5.2. We condition the variational approxi- but also on the actions ~u. ound to the conditional log-likelihood in a form that will factorize amenable. The lower bound in Eq. (1) has an analytic form of the 1 earning using Stochastic Backpropagation Maximizing a Lower Bound m to ﬁt the generative model parameters ✓ which maximize the conditional likelihood o ven the external actions, i.e we desire max✓ log p✓(x1 . . . , xT |u1 . . . uT 1). Using the principle, we apply the lower bound on the log-likelihood of the observations ~x deriv . We consider the extension of the Eq. (1) to the temporal setting where we use the follo zation of the prior: q (~z|~x, ~u) = TY t=1 q(zt|zt 1, xt, . . . , xT , ~u) tivate this structured factorization of q in Section 5.2. We condition the variational app not just on the inputs ~x but also on the actions ~u. al is to derive a lower bound to the conditional log-likelihood in a form that will fac and make learning more amenable. The lower bound in Eq. (1) has an analytic form o m only for the simplest of transition models G↵, S . Resorting to sampling for estimatin nt of the KL term results in very high variance. Below we show another way to factoriz m which results in more stable gradients, by using the Markov property of our model. Algorithm 1 Learning Deep Kalman Filters while notConverged() do ~x sampleMiniBatch() Perform inference and estimate likelihood: 1. ˆz ⇠ q (~z|~x, ~u) 2. ˆx ⇠ p✓(~x|ˆz) 3. Compute r✓L and r L (Differentiating (5)) 4. Update ✓, using ADAM end while We have for the conditional log-likelihood (see Supplemental Section A for a more detailed deriva- tion): log p✓(~x|~u) Z ~z q (~z|~x, ~u) log p0(~z|~u)p✓(~x|~z, ~u) q (~z|~x, ~u) d~z = E q (~z|~x,~u) [log p✓(~x|~z, ~u)] KL(q (~z|~x, ~u)||p0(~z|~u)) (using xt ?? x¬t|~z) = TX t=1 E zt⇠q (zt|~x,~u) [log p✓(xt|zt, ut 1)] KL(q (~z|~x, ~u)||p0(~z|~u)) = L(x; (✓, )). The KL divergence can be factorized as: (4)KL(q (~z|~x, ~u)||p (~z))
12. 12. ¤ KL ¤ log p✓(~x|~u) ~z q (~z|~x, ~u) log q (~z|~x, ~u) d~z = E q (~z|~x,~u) [log p✓(~x|~z, ~u)] KL(q (~z|~x, ~u)||p0(~z|~u)) (using xt ?? x¬t|~z) = TX t=1 E zt⇠q (zt|~x,~u) [log p✓(xt|zt, ut 1)] KL(q (~z|~x, ~u)||p0(~z|~u)) = L(x; (✓, )). The KL divergence can be factorized as: (4)KL(q (~z|~x, ~u)||p0(~z)) = Z z1 . . . Z zT q (z1|~x, ~u) . . . q (zT |zT 1, ~x, ~u) log p0(z1, · · · , zT ) q (z1|~x, ~u) . . . q (zT |zT 1, ~x, ~u) d~z (Factorization of p(~z)) = KL(q (z1|~x, ~u)||p0(z1)) + TX t=2 E zt 1⇠q (zt 1|~x,~u) [KL(q (zt|zt 1, ~x, ~u)||p0(zt|zt 1, ut 1))] . This yields: log p✓(~x|~u) L(x; (✓, )) = TX t=1 E q (zt|~x,~u) [log p✓(xt|zt)] KL(q (z1|~x, ~u)||p0(z1)) TX t=2 E q (zt 1|~x,~u) [KL(q (zt|zt 1, ~x, ~u)||p0(zt|zt 1, ut 1))] . (5) We have for the conditional log-likelihood (see Supplemental Section A for a more detailed deriva- tion): log p✓(~x|~u) Z ~z q (~z|~x, ~u) log p0(~z|~u)p✓(~x|~z, ~u) q (~z|~x, ~u) d~z = E q (~z|~x,~u) [log p✓(~x|~z, ~u)] KL(q (~z|~x, ~u)||p0(~z|~u)) (using xt ?? x¬t|~z) = TX t=1 E zt⇠q (zt|~x,~u) [log p✓(xt|zt, ut 1)] KL(q (~z|~x, ~u)||p0(~z|~u)) = L(x; (✓, )). The KL divergence can be factorized as: (4)KL(q (~z|~x, ~u)||p0(~z)) = Z z1 . . . Z zT q (z1|~x, ~u) . . . q (zT |zT 1, ~x, ~u) log p0(z1, · · · , zT ) q (z1|~x, ~u) . . . q (zT |zT 1, ~x, ~u) d~z (Factorization of p(~z)) = KL(q (z1|~x, ~u)||p0(z1)) + TX t=2 E zt 1⇠q (zt 1|~x,~u) [KL(q (zt|zt 1, ~x, ~u)||p0(zt|zt 1, ut 1))] . This yields: log p✓(~x|~u) L(x; (✓, )) = TX t=1 E q (zt|~x,~u) [log p✓(xt|zt)] KL(q (z1|~x, ~u)||p0(z1)) TX t=2 E q (zt 1|~x,~u) [KL(q (zt|zt 1, ~x, ~u)||p0(zt|zt 1, ut 1))] . (5) Equation (5) is differentiable in the parameters of the model (✓, ), and we can apply backprop- agation for updating ✓, and stochastic backpropagation for estimating the gradient w.r.t. of the
13. 13. ¤ → ¤ ¤ &⃗ ¤ ! + Universality of normalizing flows[Rezende+ 2015] • q-INDEP: q(zt|xt, ut) parameterized by an MLP • q-LR: q(zt|xt 1, xt, xt+1, ut 1, ut, ut+1) parameterized by an MLP • q-RNN: q(zt|x1, . . . , xt, u1, . . . ut) parameterized by a RNN • q-BRNN: q(zt|x1, . . . , xT , u1, . . . , uT ) parameterized by a bi-directional RNN In the experimental section we compare the performance of these four models on a challenging sequence reconstruction task. An interesting question is whether the Markov properties of the model can enable better design of approximations to the posterior. Theorem 5.1. For the graphical model depicted in Figure 1b, the posterior factorizes as: p(~z|~x, ~u) = p(z1|~x, ~u) TY t=2 p(zt|zt 1, xt, . . . , xT , ut 1, . . . , uT 1) Proof. We use the independence statements implied by the generative model in Figure 1b to note that p(~z|~x, ~u), the true posterior, factorizes as: • q-INDEP: q(zt|xt, ut) parameterized by an MLP • q-LR: q(zt|xt 1, xt, xt+1, ut 1, ut, ut+1) parameterized by an MLP • q-RNN: q(zt|x1, . . . , xt, u1, . . . ut) parameterized by a RNN • q-BRNN: q(zt|x1, . . . , xT , u1, . . . , uT ) parameterized by a bi-directional RNN In the experimental section we compare the performance of these four models on a challenging sequence reconstruction task. An interesting question is whether the Markov properties of the model can enable better design of approximations to the posterior. Theorem 5.1. For the graphical model depicted in Figure 1b, the posterior factorizes as: p(~z|~x, ~u) = p(z1|~x, ~u) TY t=2 p(zt|zt 1, xt, . . . , xT , ut 1, . . . , uT 1) Proof. We use the independence statements implied by the generative model in Figure 1b to note that p(~z|~x, ~u), the true posterior, factorizes as: p(~z|~x, ~u) = p(z1|~x, ~u) TY p(zt|zt 1, ~x, ~u)
14. 14. Healing MNIST ¤ ¤ MNIST ¤ , & ¤ 3 ¤ , ¤ 20% ¤ ¤ Small Healing MNIST ¤ 40000 5 ¤ Large Healing MNIST ¤ 140000 5 TS (a) Reconstruction during training ( Figure 2: Large Healing MNIST. (a) P
15. 15. Small Healing MNIST ¤ 4 noise. We infer a sequence of 1 timestep and display the reconstructions the latent state and perform forward sampling and reconstruction from the right rotation. 0 100 200 300 400 500 Epochs 2110 2100 2090 2080 2070 2060 2050 2040 TestLogLikelihood q-BRNN q-RNN q-LR q-INDEP • q-INDEP: q(zt|xt, ut) parameterized by an MLP • q-LR: q(zt|xt 1, xt, xt+1, ut 1, ut, ut+1) parameterized by an MLP • q-RNN: q(zt|x1, . . . , xt, u1, . . . ut) parameterized by a RNN • q-BRNN: q(zt|x1, . . . , xT , u1, . . . , uT ) parameterized by a bi-directional RNN In the experimental section we compare the performance of these four models on a challenging sequence reconstruction task. An interesting question is whether the Markov properties of the model can enable better design of approximations to the posterior. Theorem 5.1. For the graphical model depicted in Figure 1b, the posterior factorizes as: p(~z|~x, ~u) = p(z1|~x, ~u) TY t=2 p(zt|zt 1, xt, . . . , xT , ut 1, . . . , uT 1) Proof. We use the independence statements implied by the generative model in Figure 1b to note that p(~z|~x, ~u), the true posterior, factorizes as: p(~z|~x, ~u) = p(z1|~x, ~u) TY t=2 p(zt|zt 1, ~x, ~u) Now, we notice that zt ?? x1, . . . , xt 1|zt 1 and zt ?? u1 . . . , ut 2|zt 1, yielding: p(~z|~x, ~u) = p(z1|~x, ~u) TY t=2 p(zt|zt 1, xt, . . . , xT , ut 1, . . . , uT 1) q-BRNN forward backward q-INDEP q-LR
16. 16. Small Healing MNIST ¤ ¤ q-BRNN q-RNN (c) Counterfactual Reasoning. We reconstruct variants of the digits 5, 1 not presen (bottom) and without (top) bit-ﬂip noise. We infer a sequence of 1 timestep and di from the posterior. We then keep the latent state and perform forward sampling and generative model under a constant right rotation. q-BRNNq-RNNq-LRq-INDEP (a) Samples from models trained with different q (b) Test Log-Likelihood different q Figure 3: Small Healing MNIST: (a) Mean probabilities sampled under different constant, large rotation applied to the right. (b) Test log-likelihood under different rec
17. 17. Large Healing MNIST ¤ ¤ TS: ¤ R: ¤ R TS R TS R TS R TS R TS (a) Reconstruction during training (b) Samples: Different rotations (c) In Figure 2: Large Healing MNIST. (a) Pairs of Training Sequences (TS) and Mean tions (R) shown above. (b) Mean probabilities sampled from the model under
18. 18. Large Healing MNIST ¤ ¤ (a) Reconstruction during training Large Rotation Right Large Rotation Left No Rotation (b) Samples: Different rotations (c) Inference on Figure 2: Large Healing MNIST. (a) Pairs of Training Sequences (TS) and Mean Probabilit tions (R) shown above. (b) Mean probabilities sampled from the model under different, c (c) Counterfactual Reasoning. We reconstruct variants of the digits 5, 1 not present in the Small Rotation Right Small Rotation Left Large Rotation Right
19. 19. Large Healing MNIST ¤ ¤ ¤ ¤ (a) Reconstruction during training (b) Samples: Different rotations (c) Inference on unseen digits Figure 2: Large Healing MNIST. (a) Pairs of Training Sequences (TS) and Mean Probabilities of Reconstruc- tions (R) shown above. (b) Mean probabilities sampled from the model under different, constant rotations. (c) Counterfactual Reasoning. We reconstruct variants of the digits 5, 1 not present in the training set, with (bottom) and without (top) bit-ﬂip noise. We infer a sequence of 1 timestep and display the reconstructions from the posterior. We then keep the latent state and perform forward sampling and reconstruction from the generative model under a constant right rotation.
20. 20. ¤ ¤ ¤ 8000 ¤ ¤ A1c ¤ ¤ 9 ¤ 3 4 1 ¤ . ¤ A1c
21. 21. ¤ ¤ A1c ¤ ¤ lab indicator &-./ ¤ lab ¤ lab indicator ¤ do ¤ Lab indicator 1 xt ztzt 1 zt+1 xt ind (a) Graphical model during training xt ztzt 1 zt+1 1 (b) Graphical model during counterfactual inference Figure 5: (a) Generative model with lab indicator variable, focusing on time step t. (b) For counterfactual inference we set xt ind to 1, implementing Pearl’s do-operator
22. 22. ¤ q-BRNN 8000 ¤ ¤ L : linear relationship ¤ NL : 2 NN 1 ¤ Em(ission) : Fκ ¤ Tr(ansition) : Gα 0 500 1000 1500 2000 Epochs 150 140 130 120 110 100 90 80 TestLL Em: L, Tr: NL Em: NL, Tr: NL Em: L, Tr: L Em: NL, Tr: L (b)
23. 23. ¤ ¤ 800 ¤ with without ¤ A1c ¤ 0 2 4 6 8 10 12 Time Steps 0.0 0.2 0.4 0.6 0.8 1.0 ProportionofpatientswithhighGlucose With diabetic drugs 0 2 4 6 8 10 12 Time Steps 0.0 0.2 0.4 0.6 0.8 1.0 ProportionofpatientswithhighGlucose Without diabetic drugs (b) 2000 (a) Test Log-Likelihood (b) 0 2 4 6 8 10 12 Time Steps 0.0 0.2 0.4 0.6 0.8 1.0 ProportionofpatientswithhighA1c With diabetic drugs 0 2 4 6 8 10 12 Time Steps 0.0 0.2 0.4 0.6 0.8 1.0 ProportionofpatientswithhighA1c Without diabetic drugs (c) Figure 4: Results of disease progression modeling for 8000 diabetic and pre-diabetic patients. (a) Test log- likelihood under different model variants. Em(ission) denotes F, Tr(ansition) denotes G↵ under Linear (L) and Non-Linear (NL) functions. We learn a ﬁxed diagonal S . (b) Proportion of patients inferred to have high (top quantile) glucose level with and without anti-diabetic drugs, starting from the time of ﬁrst Metformin prescription. (c) Proportion of patients inferred to have high (above 8%) A1c level with and without anti- diabetic drugs, starting from the time of ﬁrst Metformin prescription. Both (b) and (c) are created using the
24. 24. ¤ ¤ ¤ ¤ ¤ ¤ ¤
25. 25. ¤ ¤ ( ) http://www.slideshare.net/takehikoihayashi/ss-13441401 ¤ Judea Pearl do http://takehiko-i- hayashi.hatenablog.com/entry/20111222/1324487579