Interpretable deep learning for healthcare

Interpretable Deep Learning
for Healthcare
Edward Choi (mp2893@gatech.edu)
Jimeng Sun (jsun@cc.gatech.edu)
SunLab (sunlab.org)

Index
• Healthcare & Machine Learning
• Sequence Prediction with RNN
• Attention mechanism & interpretable prediction
• Proposed model: RETAIN
• Experiments & results
• Conclusion
2

Healthcare
&
Machine Learning

SunLab & Healthcare
• SunLab & Collaborators
ProviderGovernment University Company
4

SunLab Healthcare Projects
• Predictive analytics pipeline & Bayesian optimization
• Patient phenotyping
• Treatment recommendation
• Epilepsy patient prediction
• Heart failure prediction
• Disease progression modeling
5

6

7
Observation
Window
Diagnosis
Date
Prediction
Window
Index Date
Time

8
Cough
Visit 1
Fever
Fever
Visit 2
Chill Fever
Visit 3
Pneumonia
Chest X-ray
Tylenol
IV fluid

9
Recurrent Neural Network (RNN)

Sequence Prediction - NLP
• Given a sequence of symbols, predict a certain outcome.
• Is the given sentence positive or negative ?
• “Justice” “League” “is” “as” “impressive” “as” “a” “preschool” “Christmas” “play”
• Each word is a symbol
• Outcome: 0, 1 (binary)
• The sentence is either positive or negative.
11

Sequence Prediction - EHR
• Given a diagnosis history, will the patient have heart failure?
• Hypertension, Hypertension, Diabetes, CKD, CKD, Diabetes, MI
• Each diagnosis is a symbol
• Outcome: 0, 1 (binary)
• Either you have HF, or you don’t
12

What is sequence prediction?
• Where is the boundary between exons and introns in the DNA
sequence?
• What is the French translation of the given English sentence?
• Given a diagnosis history, what will he/she have in the next visit?
13

Sequence prediction with MLP
• Let’s start with a simple Multi-layer Perceptron (MLP)
• Sentiment classification (positive or negative?)
• “justice leagues was as impressive as a preschool christmas play”
0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, … x (a vector with 1M elements. One for each word)
boy
cat
justice
preschool
fun
14

Input Layer x
Hidden Layer h
x (a vector with 1M elements. One for each word)
h = 𝝈(Wh
Tx) (transform x for an easier prediction)
15

Input Layer x
Hidden Layer h
Output y
h = 𝝈(Wh
y = 𝝈(wo
Th) (generate an outcome 0.0~1.0)
16

Input Layer x
Hidden Layer h
Output y
h = 𝝈(Wh
y = 𝝈(wo
Th) (generate an outcome 0.0~1.0)
17

Sequence prediction with RNN
• Now let’s use Recurrent Neural Network (RNN)
• Same sentiment classification (positive or negative?)
Hidden Layer h1
h1= 𝝈(Wi
Tx1)
x1 (a vector with 1M elements. Only “justice” is 1.)
18
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
justice

• Let’s use RNN now
h2
League
h1
Justice
x1 x2
h2= 𝝈(Wh
Th1 + Wi
Tx2)
19

h9
h10
h2
League Christmas play
h1
Justice
x1 x2 x9 x10
h10= 𝝈(Wh
Th9 + Wi
Tx10)
20

h10
Output
y = 𝝈(wo
Th10)
Outcome 0.0 ~ 1.0
21

h10
Output
y = 𝝈(wo
Th10)
Outcome 0.0 ~ 1.0
22

Limitation of RNN
• Transparency
• RNN is a blackbox
• Feed input, receive output
• Hard to tell what caused the outcome
23

Limitation of RNN
• Transparency
• Outcome 0.9
• Was it because of “Justice”?
• Was it because of “impressive”?
• Was it because of “Christmas”?
24

Limitation of RNN
• Transparency
h9
h10
h2
h1
Justice
All inputs accumulated here
25

Attention mechanism
&
Interpretable Prediction

Attention models
• Bahdanau, Cho, Bengio, 2014
• English-French translation using RNN
• Let’s use hidden layers from all timesteps to make predictions
27

Attention models
h9
h10
h2
h1
Justice 28

Attention models
h9
h10
h2
h1
Justice
c
𝛼#
𝛼$ 𝛼%
𝛼#& 𝛼# + 𝛼$ + ⋯ + 𝛼#& = 1
𝒄 = 𝛼# 𝒉# + 𝛼$ 𝒉$ + ⋯ + 𝛼#& 𝒉#&
29

Attention models
Outputc
𝛼#
𝛼$ 𝛼%
𝛼#&
y = 𝝈(wo
Tc)
h9
h10
h2
h1
Justice 30

Attention models
• Attention, what is it good for?
31

Attention models
• c is an explicit combination of all past information
• 𝛼#, 𝛼$, ⋯ , 𝛼#& denote the usefulness from each word
• We can tell which word was used the most/least to the outcome
c
𝛼#
𝛼$ 𝛼%
𝛼#&
32

Attention models
• Now c is an explicit combination of all past information
• 𝛼#, 𝛼$, ⋯ , 𝛼#& denote the usefulness from each word
• We can tell which word was used the most/least to the outcome
• Attentions 𝛼. are generated using an MLP
c
𝛼#
𝛼$ 𝛼%
𝛼#&
33

Attention Example
• English-French translation
• Bahdanau, Cho, Bengio 2014
(a)
(c)
Figure3:FoursamplealignmentsfoundbyRN
correspondtothewordsinthesourcesentence(
respectively.Eachpixelshowstheweight↵ijoft
targetword(seeEq.(6)),ingrayscale(0:black,
randomlyselectedsamplesamongthesentencesw
10and20wordsfromthetestset.
Oneofthemotivationsbehindtheproposedappr
inthebasicencoder–decoderapproach.Wecon
encoder–decoderapproachtounderperformwith
manceofRNNencdecdramaticallydropsasthelen
bothRNNsearch-30andRNNsearch-50aremore
50,especially,showsnoperformancedeterioratio
superiorityoftheproposedmodeloverthebasic
34

RETAIN: Interpretable Sequence
Prediction for Healthcare
(NIPS 2016)

Structure of EHR
• Assumption so far
• Word sequence = Dx sequence
• Justice, League, is, as, impressive, as, …
• Cough, Benzonatate, Fever, Pneumonia, Chest X-ray, Amoxicillin, ...
36

Structure of EHR
Cough
Benzonatate
Fever
Pneumonia Amoxicillin
Chest X-ray
Time
37

Structure of EHR
Cough
Benzonatate
Fever
Pneumonia Amoxicillin
Chest X-ray
Time
38

Structure of EHR
Cough
Visit 1
Fever
Fever
Visit 2
Chill Fever
Visit 3
Pneumonia
Chest X-ray
Tylenol
IV fluid
39

Straightforward RNN for EHR
• RNN now accepts multiple medical codes at each timestep (i.e. visit)
1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, … x1 (First visit vector with 40K elements. One for each medical code)
cough fever
tylenol
pneumonia
40
Cough
Visit 1
Fever
Tylenol
IV fluid

• RNN now accepts multiple medical codes at each timestep (aka visit)
Input Layer x1
Embedding Layer v1
x1 (a multi-hot vector with 40K elements. One for each code)
v1 = tanh(Wv
Tx1) (Transform x to a compact representation)
41

Input Layer x1
Embedding Layer v1
x1 (a multi-hot vector with 40K elements. One for each code)
v1 = tanh(Wv
Tx1) (Transform x to a compact representation)
Hidden Layer h1
h1= 𝝈(Wi
Tv1)
42

x1
v1
Hidden Layer h1
x2
v2
Hidden Layer h2
h2= 𝝈(Wh
Th1 + Wi
Tv2)
43

x1
v1
Hidden Layer h1
x2
v2
Hidden Layer h2
xT
vT
Hidden Layer hT
hT= 𝝈(Wh
ThT-1 + Wi
TvT-1)
44

Hidden Layer hT
Output
y = 𝝈(wo
ThT)
Outcome 0.0 ~ 1.0
45

RETAIN: Motivation
• Which visit contributes more to the final prediction?
x1
v1
Hidden Layer h1
x2
v2
Hidden Layer h2
xT
vT
Hidden Layer hT
46

RETAIN: Motivation
• Within a single visit, which code contributes more to the prediction?
v1
Hidden Layer h1
v2
Hidden Layer h2
vT
Hidden Layer hT
1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, …
cough fever tylenol pneumonia 47

RETAIN: Design Choices
48
"#
$#
%#
&#
⨀
(# "# $#
%#
&#
⨀ (#
)#
*#
Standard attention model RETAIN

49
"#
$#
%#
&#
⨀
(# "# $#
%#
&#
⨀ (#
)#
*#
MLP embeds the visitsRNN embeds the visits

50
"#
$#
%#
&#
⨀
(# "# $#
%#
&#
⨀ (#
)#
*#
RNN generates
attention for
the visits
MLP generates
attentions for
the visits

51
"#
$#
%#
&#
⨀
(# "# $#
%#
&#
⨀ (#
)#
*#
Another RNN generates
attentions for the codes
within each visit

52
"#
$#
%#
&#
⨀
(# "# $#
%#
&#
⨀ (#
)#
*#
Visits are combined for prediction Visits are combined for prediction

53
"#
$#
%#
&#
⨀
(# "# $#
%#
&#
⨀ (#
)#
*#
Less interpretable end-to-end Interpretable end-to-end

RETAIN: Model Architecture
,# ,) ,*
&# &) &*
"# ") "*
$# $) $*
+# +) +*
'# ') '*
Σ
.* /*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict54

,# ,) ,*
&# &) &*
"# ") "*
$# $) $*
+# +) +*
'# ') '*
Σ
.* /*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict55
an RNN. To find the j-th word in the target language, we generate attentions ↵i
word in the original sentence. Then, we compute the context vector cj =
P
i ↵j
i hi
j-th word in the target language. In general, the attention mechanism allows the mo
word (or words) in the given sentence when generating each word in the target lan
In this work, we define a temporal attention mechanism to provide interpreta
healthcare. Doctors generally pay attention to specific clinical information (e.g., k
timing when reviewing EHR data. We exploit this insight to develop a temporal atte
doctors’ practice, which will be introduced next.
2.2 Reverse Time Attention Model RETAIN
Figure 2 shows the high-level overview of our model. One key idea is to delegate a
the prediction responsibility to the attention weights generation process. RNNs bec
due to the recurrent weights feeding past information to the hidden layer. Theref
visit-level and the variable-level (individual coordinates of xi) influence, we use a
input vector xi. That is, we define
vi = Exi,
where vi 2 Rm
denotes the embedding of the input vector xi 2 Rr
, m the size of t
E 2 Rm⇥r
the embedding matrix to learn. We can easily choose a more sophisticat

,# ,) ,*
&# &) &*
"# ") "*
$# $) $*
+# +) +*
'# ') '*
Σ
.* /*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict
where vi 2 Rm
, m th100
dimension, Wemb 2 Rm⇥r
the embedding matrix to learn. We can easily cho101
but still interpretable representation such as multilayer perceptron (MLP)102
used for representation learning in EHR data [10].103
We use two sets of weights for the visit-level attention and the variable-lev104
The scalars ↵1, . . . , ↵i are the visit-level attention weights that govern th105
embedding v1, . . . , vi. The vectors 1, . . . , i are the variable-level attenti106
each coordinate of the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, .107
We use two RNNs, RNN↵ and RNN , to separately generate ↵’s and ’s a108
gi, gi 1, . . . , g1 = RNN↵(vi, vi 1, . . . , v1),
ej = w>
↵ gj + b↵, for j = 1, . . . , i
↵1, ↵2, . . . , ↵i = Softmax(e1, e2, . . . , ei)
hi, hi 1, . . . , h1 = RNN (vi, vi 1, . . . , v1)
j = tanh W hj + b for j = 1,
where gi 2 Rp
is the hidden layer of RNN↵ at time step i, hi 2 Rq
the109
at time step i and w↵ 2 Rp
, b↵ 2 R, W 2 Rm⇥q
and b 2 Rm
are110
The hyperparameters p and q determine the hidden layer size of RNN↵ a111
3
56

,# ,) ,*
&# &) &*
"# ") "*
$# $) $*
+# +) +*
'# ') '*
Σ
.* /*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
mb 2 R the embedding matrix to learn. We can easily choose a more sophisticated
table representation such as multilayer perceptron (MLP) [13, 29] which has been
ntation learning in EHR data [10].
of weights for the visit-level attention and the variable-level attention, respectively.
. . . , ↵i are the visit-level attention weights that govern the inﬂuence of each visit
. . , vi. The vectors 1, . . . , i are the variable-level attention weights that focus on
of the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, . . . , vi,m.
Ns, RNN↵ and RNN , to separately generate ↵’s and ’s as follows,
gi, gi 1, . . . , g1 = RNN↵(vi, vi 1, . . . , v1),
ej = w>
↵ gj + b↵, for j = 1, . . . , i
↵1, ↵2, . . . , ↵i = Softmax(e1, e2, . . . , ei) (Step 2)
hi, hi 1, . . . , h1 = RNN (vi, vi 1, . . . , v1)
j = tanh W hj + b for j = 1, . . . , i, (Step 3)
is the hidden layer of RNN↵ at time step i, hi 2 Rq
the hidden layer of RNN
nd w↵ 2 Rp
, b↵ 2 R, W 2 Rm⇥q
and b 2 Rm
are the parameters to learn.
meters p and q determine the hidden layer size of RNN↵ and RNN , respectively.
3
57

,# ,) ,*
&# &) &*
"# ") "*
$# $) $*
+# +) +*
'# ') '*
Σ
.* /*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
records, they typically study the patient’s most recent records ﬁrst, and go back in time.
ationally, running the RNN in reversed time order has several advantages as well: The reverse
der allows us to generate e’s and ’s that dynamically change their values when making
ons at different time steps i = 1, 2, . . . , T. It ensures that the attention vectors will be different
timestamp and makes the attention generation process computationally more stable.1
erate the context vector ci for a patient up to the i-th visit as follows,
ci =
iX
j=1
↵j j vj, (Step 4)
denotes element-wise multiplication. We use the context vector ci 2 Rm
to predict the true
2 {0, 1}s
as follows,
byi = Softmax(Wci + b), (Step 5)
W 2 Rs⇥m
and b 2 Rs
are parameters to learn. We use the cross-entropy to calculate the
ation loss as follows,
L(x1, . . . , xT ) =
1
N
NX
n=1
1
T(n)
T (n)
X
i=1
⇣
y>
i log(byi) + (1 yi)>
log(1 byi)
⌘
(1)
we sum the cross entropy errors from all dimensions of byi. In case of real-valued output
, we can change the cross-entropy in Eq. (1) to for example mean squared error.
our attention mechanism can be viewed as the inverted architecture of the standard attention
ism for NLP [2] where the words are encoded using RNN and generate the attention weights
58

,# ,) ,*
&# &) &*
"# ") "*
$# $) $*
+# +) +*
'# ') '*
Σ
.* /*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ci =
iX
j=1
↵j j vj,
where denotes element-wise multiplication. We use the context vector ci 2123
label yi 2 {0, 1}s
as follows,124
byi = Softmax(Wci + b),
where W 2 Rs⇥m
and b 2 Rs
are parameters to learn. We use the cross-en125
classiﬁcation loss as follows,126
L(x1, . . . , xT ) =
1
N
NX
n=1
1
T(n)
T (n)
X
i=1
⇣
y>
log
where we sum the cross entropy errors from all dimensions of byi. In case127
yi 2 Rs
, we can change the cross-entropy in Eq. (1) to for example mean squ128
Overall, our attention mechanism can be viewed as the inverted architecture of129
mechanism for NLP [2] where the words are encoded using RNN and generate130
using MLP. Our method, on the other hand, uses MLP to embed the visit in131
interpretation and uses RNN to generate two sets of attention weights, reco132
information as well as mimicking the behavior of physicians.133
59

RETAIN: Calculating the Contributions
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values ﬁxed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
60

the past records, they typically study the patient’s most recent records ﬁ117
Computationally, running the RNN in reversed time order has several advan118
time order allows us to generate e’s and ’s that dynamically change th119
predictions at different time steps i = 1, 2, . . . , T. It ensures that the attentio120
at each timestamp and makes the attention generation process computation121
We generate the context vector ci for a patient up to the i-th visit as follow122
ci =
iX
j=1
↵j j vj,
where denotes element-wise multiplication. We use the context vector ci123
label yi 2 {0, 1}s
as follows,124
byi = Softmax(Wci + b),
where W 2 Rs⇥m
and b 2 Rs
are parameters to learn. We use the cross125
classiﬁcation loss as follows,126
L(x1, . . . , xT ) =
1
N
NX
n=1
1
T(n)
T (n)
X
i=1
⇣
y>
where we sum the cross entropy errors from all dimensions of byi. In ca127
yi 2 Rs
, we can change the cross-entropy in Eq. (1) to for example mean128
Overall, our attention mechanism can be viewed as the inverted architecture129
mechanism for NLP [2] where the words are encoded using RNN and gene130
, which
pressed as follows
2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
61
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
where ci 2 Rm
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i,

, which
where ci 2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
, which
pressed as follows
2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
62
2.2 Reverse Time Attention Model RETAIN
Figure 2 shows the high-level overview of our model. One key idea is
the prediction responsibility to the attention weights generation proces
due to the recurrent weights feeding past information to the hidden l
visit-level and the variable-level (individual coordinates of xi) influen
input vector xi. That is, we define
vi = Exi,
where vi 2 Rm
, m
E 2 Rm⇥r
the embedding matrix to learn. We can easily choose a mor
representation such as multilayer perceptron (MLP) [13, 28] which has
in EHR data [10].
We use two sets of weights for the visit-level attention and the vari
scalars ↵1, . . . , ↵i are the visit-level attention weights that govern th
v1, . . . , vi. The vectors 1, . . . , i are the variable-level attention weig
the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, . . . , vi,m.
We use two RNNs, RNN↵ and RNN , to separately generate ↵’s a
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
be rewritten as follows,
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
k-th column of E

, which
pressed as follows
2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
Inside the iteration over k
63
e in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the largest change in
e the input variable with highest contribution. More formally, given the sequence x1, . . . , xi, we are
2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
n be rewritten as follows,
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
!(y , x ) = ↵ W( e ) x , (5)
, which
where ci 2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
xj,k
|{z}
Input value
, (5)
Scalars in the front

, which
pressed as follows
2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
64
1 i
2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
xj,k
|{z}
Input value
, (5)
, which
where ci 2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
xj,k
|{z}
Input value
, (5)

, which
pressed as follows
2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
Contribution of the k-th code in the j-th visit
65
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
e visit embedding vi is the sum of the columns of E weighted by each element of xi,
en as follows,
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
econstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
xj,k
|{z}
Input value
, (5)
i is omitted in the ↵j and j. As we have described in Section 2.2, we are generating
1 i
2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
xj,k
|{z}
Input value
, (5)
, which
where ci 2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
2 Rm
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
xj,k
|{z}
Input value
, (5)

Heart Failure (HF) Prediction
• Objective
• Given a patient record, predict whether he/she will be diagnosed with HF in the
future
• 34K patients from Sutter PAMF
• 4K cases, 30K controls
• Use 18-months history before being diagnosed with HF
• Number of medical codes
• 283 diagnosis codes
• 96 medication codes
• 238 procedure codes
67
617 medical codes

Heart failure prediction
• Performance measure
• Area under the ROC curve (AUC)
• Competing models
• Logistic regression
• Aggregate all past codes into a fixed-size vector. Feed it to LR
• MLP
• Aggregate all past codes into a fixed-size vector. Feed it to MLP
• Two-layer RNN
• Visits are fed to the RNN, whose hidden layers are fed to another RNN.
• RNN+attention (Bahdanau et al. 2014)
• Visits are fed to RNN. Visit-level attentions are generated by MLP
• RETAIN
68

Heart failure prediction
Models AUC Training time / epoch Test time for 5K patients
Logistic Regression 0.7900 ± 0.0111 0.15s 0.11s
MLP 0.8256 ± 0.0096 0.25s 0.11s
Two-layer RNN 0.8706 ± 0.0080 10.3s 0.57s
RNN+attention 0.8624 ± 0.0079 6.7s 0.48s
RETAIN 0.8705 ± 0.0081 10.8s 0.63s
• RETAIN as accurate as RNN
• Requires similar training time & test time
• RETAIN is interpretable!
69

RETAIN visualization
• Demo
70

Conclusion
• RETAIN: interpretable prediction framework
• As accurate as RNN
• Interpretable prediction
• Predictions can be explained
• Can be extended to general prognosis
• What are the likely disease he/she will have in the future?
• Can be used for any sequences with the two-layer structure
• E.g. online shopping
71

How to generate the attentions 𝛼.?
• Use another neural network model
Input Layer x
Hidden Layer h
Output y
x
h = 𝝈(Wh
Tx)
y = wo
Th (outcome −∞~ + ∞)
Let’s call this function y=a(x)
73

• Use function a(x) for each word: Justice, League, …, Christmas, play
• Feed the scores y1, y2, …, y10 into the Softmax function
League playJustice
a(x1)
y1
a(x2)
y2
a(x10)
y10
𝛼. =
exp ( 𝑦.)
∑ exp ( 𝑦:)#&
:;#
Christmas
a(x9)
y9
74

• Use function a(x) for each word: Justice, League, …, Christmas, play
• Feed the scores y1, y2, …, y10 into the Softmax function
League playJustice
a(x1)
y1
a(x2)
y2
a(x10)
y10
𝛼. =
exp ( 𝑦.)
∑ exp ( 𝑦:)#&
:;#
Christmas
a(x9)
y9
Softmax function ensures 𝛼.’s sum to 1
Return75

Interpretable deep learning for healthcare

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Interpretable deep learning for healthcare

Similar to Interpretable deep learning for healthcare (20)

More from NAVER Engineering

More from NAVER Engineering (20)

Recently uploaded

Recently uploaded (20)

Interpretable deep learning for healthcare