SlideShare a Scribd company logo
Interpretable	Deep	Learning	
for	Healthcare
Edward	Choi	(mp2893@gatech.edu)
Jimeng Sun	(jsun@cc.gatech.edu)
SunLab (sunlab.org)
Index
• Healthcare	&	Machine	Learning
• Sequence	Prediction	with	RNN
• Attention	mechanism &	interpretable	prediction
• Proposed	model:	RETAIN
• Experiments	&	results
• Conclusion
2
Healthcare	
&	
Machine	Learning
SunLab &	Healthcare
• SunLab &	Collaborators
ProviderGovernment University Company
4
SunLab Healthcare	Projects
• Predictive	analytics	pipeline	&	Bayesian	optimization
• Patient	phenotyping
• Treatment	recommendation
• Epilepsy	patient	prediction
• Heart	failure	prediction
• Disease	progression	modeling
5
SunLab Healthcare	Projects
• Predictive	analytics	pipeline	&	Bayesian	optimization
• Patient	phenotyping
• Treatment	recommendation
• Epilepsy	patient	prediction
• Heart	failure	prediction
• Disease	progression	modeling
6
SunLab Healthcare	Projects
• Predictive	analytics	pipeline	&	Bayesian	optimization
• Patient	phenotyping
• Treatment	recommendation
• Epilepsy	patient	prediction
• Heart	failure	prediction
• Disease	progression	modeling
7
Observation
Window
Diagnosis
Date
Prediction
Window
Index Date
Time
SunLab Healthcare	Projects
• Predictive	analytics	pipeline	&	Bayesian	optimization
• Patient	phenotyping
• Treatment	recommendation
• Epilepsy	patient	prediction
• Heart	failure	prediction
• Disease	progression	modeling
8
Cough
Visit 1
Fever
Fever
Visit 2
Chill Fever
Visit 3
Pneumonia
Chest X-ray
Tylenol
IV fluid
SunLab Healthcare	Projects
• Predictive	analytics	pipeline	&	Bayesian	optimization
• Patient	phenotyping
• Treatment	recommendation
• Epilepsy	patient	prediction
• Heart	failure	prediction
• Disease	progression	modeling
9
Recurrent	Neural	Network	(RNN)
Sequence	Prediction	
with	RNN
Sequence	Prediction	- NLP
• Given	a	sequence	of	symbols,	predict	a	certain	outcome.
• Is	the	given	sentence	positive	or	negative	?
• “Justice”	“League”	“is”	“as”	“impressive”	“as”	“a”	“preschool”	“Christmas”	“play”
• Each	word	is	a	symbol
• Outcome:	0,	1	(binary)
• The	sentence	is	either	positive	or	negative.
11
Sequence	Prediction	- EHR
• Given	a	sequence	of	symbols,	predict	a	certain	outcome.
• Given	a	diagnosis	history,	will	the	patient	have	heart	failure?
• Hypertension,	Hypertension,	Diabetes,	CKD,	CKD,	Diabetes,	MI
• Each	diagnosis	is	a	symbol
• Outcome:	0, 1	(binary)
• Either	you	have	HF,	or	you	don’t
12
What	is	sequence	prediction?
• Given	a	sequence	of	symbols,	predict	a	certain	outcome.
• Where	is	the	boundary	between	exons	and	introns	in	the	DNA	
sequence?
• What	is	the	French	translation	of	the	given	English	sentence?
• Given	a	diagnosis	history,	what	will	he/she	have	in	the	next	visit?
13
Sequence	prediction	with	MLP
• Let’s	start	with	a	simple	Multi-layer	Perceptron	(MLP)
• Sentiment	classification	(positive	or	negative?)
• “justice	leagues	was	as	impressive	as	a	preschool	christmas play”
0,	0,	0,	1,	0,	0,	0,	1,	0,	0,	0,	0,	… x	(a	vector	with	1M	elements.	One	for	each	word)
boy
cat
justice
preschool
fun
14
Sequence	prediction	with	MLP
• Let’s	start	with	a	simple	Multi-layer	Perceptron	(MLP)
• Sentiment	classification	(positive	or	negative?)
Input	Layer	x
Hidden	Layer	h
x	(a	vector	with	1M	elements.	One	for	each	word)
h =	𝝈(Wh
Tx) (transform	x for	an	easier	prediction)
15
Sequence	prediction	with	MLP
• Let’s	start	with	a	simple	Multi-layer	Perceptron	(MLP)
• Sentiment	classification	(positive	or	negative?)
Input	Layer	x
Hidden	Layer	h
Output	y
x	(a	vector	with	1M	elements.	One	for	each	word)
h =	𝝈(Wh
Tx) (transform	x for	an	easier	prediction)
y =	𝝈(wo
Th) (generate	an	outcome	0.0~1.0)
16
Sequence	prediction	with	MLP
• Let’s	start	with	a	simple	Multi-layer	Perceptron	(MLP)
• Sentiment	classification	(positive	or	negative?)
Input	Layer	x
Hidden	Layer	h
Output	y
x	(a	vector	with	1M	elements.	One	for	each	word)
h =	𝝈(Wh
Tx) (transform	x for	an	easier	prediction)
y =	𝝈(wo
Th) (generate	an	outcome	0.0~1.0)
17
Sequence	prediction	with	RNN
• Now	let’s	use	Recurrent	Neural	Network	(RNN)
• Same	sentiment	classification	(positive	or	negative?)
Hidden	Layer	h1
h1=	𝝈(Wi
Tx1)	
x1 (a	vector	with	1M	elements.	Only	“justice”	is	1.)
18
0,	0,	0,	1,	0,	0,	0,	0,	0,	0,	0,	0,	…
justice
Sequence	prediction	with	RNN
• Let’s	use	RNN	now
• Same	sentiment	classification	(positive	or	negative?)
h2
League
h1
Justice
x1 x2
h2=	𝝈(Wh
Th1 + Wi
Tx2)	
19
Sequence	prediction	with	RNN
• Let’s	use	RNN	now
• Same	sentiment	classification	(positive	or	negative?)
h9
h10
h2
League Christmas play
h1
Justice
x1 x2 x9 x10
h10=	𝝈(Wh
Th9 + Wi
Tx10)	
20
Sequence	prediction	with	RNN
• Let’s	use	RNN	now
• Same	sentiment	classification	(positive	or	negative?)
h10
Output
y =	𝝈(wo
Th10)
Outcome	0.0	~	1.0
21
Sequence	prediction	with	RNN
• Let’s	use	RNN	now
• Same	sentiment	classification	(positive	or	negative?)
h10
Output
y =	𝝈(wo
Th10)
Outcome	0.0	~	1.0
22
Limitation	of	RNN
• Transparency
• RNN	is	a	blackbox
• Feed	input,	receive	output
• Hard	to	tell	what	caused	the	outcome
23
Limitation	of	RNN
• Transparency
• RNN	is	a	blackbox
• Feed	input,	receive	output
• Hard	to	tell	what	caused	the	outcome
• Outcome	0.9
• Was	it	because	of	“Justice”?
• Was	it	because	of	“impressive”?
• Was	it	because	of	“Christmas”?
24
Limitation	of	RNN
• Transparency
• RNN	is	a	blackbox
• Feed	input,	receive	output
• Hard	to	tell	what	caused	the	outcome
h9
h10
h2
League Christmas play
h1
Justice
All	inputs	accumulated	here
25
Attention	mechanism
&
Interpretable	Prediction
Attention	models
• Bahdanau,	Cho,	Bengio,	2014
• English-French	translation	using	RNN
• Let’s	use	hidden	layers	from	all	timesteps to	make	predictions
27
Attention	models
• Bahdanau,	Cho,	Bengio,	2014
• English-French	translation	using	RNN
• Let’s	use	hidden	layers	from	all	timesteps to	make	predictions
h9
h10
h2
League Christmas play
h1
Justice 28
Attention	models
• Bahdanau,	Cho,	Bengio,	2014
• English-French	translation	using	RNN
• Let’s	use	hidden	layers	from	all	timesteps to	make	predictions
h9
h10
h2
League Christmas play
h1
Justice
c
𝛼#
𝛼$ 𝛼%
𝛼#& 𝛼# + 𝛼$ + ⋯ + 𝛼#& = 1
𝒄 = 𝛼# 𝒉# + 𝛼$ 𝒉$ + ⋯ + 𝛼#& 𝒉#&
29
Attention	models
• Bahdanau,	Cho,	Bengio,	2014
• English-French	translation	using	RNN
• Let’s	use	hidden	layers	from	all	timesteps to	make	predictions
Outputc
𝛼#
𝛼$ 𝛼%
𝛼#&
y =	𝝈(wo
Tc)
h9
h10
h2
League Christmas play
h1
Justice 30
Attention	models
• Attention,	what	is	it	good	for?
31
Attention	models
• Attention,	what	is	it	good	for?
• c is	an	explicit	combination	of	all	past	information
• 𝛼#, 𝛼$, ⋯ , 𝛼#& denote	the	usefulness	from	each	word
• We	can	tell	which	word	was	used	the	most/least	to	the	outcome
c
𝛼#
𝛼$ 𝛼%
𝛼#&
32
Attention	models
• Attention,	what	is	it	good	for?
• Now	c is	an	explicit	combination	of	all	past	information
• 𝛼#, 𝛼$, ⋯ , 𝛼#& denote	the	usefulness	from	each	word
• We	can	tell	which	word	was	used	the	most/least	to	the	outcome
• Attentions	𝛼. are	generated	using	an	MLP
c
𝛼#
𝛼$ 𝛼%
𝛼#&
33
Attention	Example
• English-French	translation
• Bahdanau,	Cho,	Bengio 2014
(a)
(c)
Figure3:FoursamplealignmentsfoundbyRN
correspondtothewordsinthesourcesentence(
respectively.Eachpixelshowstheweight↵ijoft
targetword(seeEq.(6)),ingrayscale(0:black,
randomlyselectedsamplesamongthesentencesw
10and20wordsfromthetestset.
Oneofthemotivationsbehindtheproposedappr
inthebasicencoder–decoderapproach.Wecon
encoder–decoderapproachtounderperformwith
manceofRNNencdecdramaticallydropsasthelen
bothRNNsearch-30andRNNsearch-50aremore
50,especially,showsnoperformancedeterioratio
superiorityoftheproposedmodeloverthebasic
34
RETAIN:	Interpretable	Sequence	
Prediction	for	Healthcare	
(NIPS	2016)
Structure	of	EHR
• Assumption	so	far
• Word	sequence	=	Dx sequence
• Justice,	League,	is,	as,	impressive,	as,	…
• Cough,	Benzonatate,	Fever,	Pneumonia,	Chest	X-ray,	Amoxicillin,	...
36
Structure	of	EHR
• Assumption	so	far
• Word	sequence	=	Dx sequence
• Justice,	League,	is,	as,	impressive,	as,	…
• Cough,	Benzonatate,	Fever,	Pneumonia,	Chest	X-ray,	Amoxicillin,	...
Cough
Benzonatate
Fever
Pneumonia Amoxicillin
Chest X-ray
Time
37
Structure	of	EHR
• Assumption	so	far
• Word	sequence	=	Dx sequence
• Justice,	League,	is,	as,	impressive,	as,	…
• Cough,	Benzonatate,	Fever,	Pneumonia,	Chest	X-ray,	Amoxicillin,	...
Cough
Benzonatate
Fever
Pneumonia Amoxicillin
Chest X-ray
Time
38
Structure	of	EHR
• Assumption	so	far
• Word	sequence	=	Dx sequence
• Justice,	League,	is,	as,	impressive,	as,	…
• Cough,	Benzonatate,	Fever,	Pneumonia,	Chest	X-ray,	Amoxicillin,	...
Cough
Visit 1
Fever
Fever
Visit 2
Chill Fever
Visit 3
Pneumonia
Chest X-ray
Tylenol
IV fluid
39
Straightforward	RNN	for	EHR
• RNN	now	accepts	multiple	medical	codes	at	each	timestep (i.e.	visit)
1,	0,	0,	1,	0,	0,	0,	1,	0,	0,	0,	0,	… x1 (First	visit	vector	with	40K	elements.	One	for	each	medical	code)
cough fever
tylenol
pneumonia
40
Cough
Visit 1
Fever
Tylenol
IV fluid
Straightforward	RNN	for	EHR
• RNN	now	accepts	multiple	medical	codes	at	each	timestep (aka	visit)
Input	Layer	x1
Embedding	Layer	v1
x1 (a	multi-hot	vector	with	40K	elements.	One	for	each	code)
v1 =	tanh(Wv
Tx1) (Transform	x to	a	compact	representation)
41
Straightforward	RNN	for	EHR
• RNN	now	accepts	multiple	medical	codes	at	each	timestep (aka	visit)
Input	Layer	x1
Embedding	Layer	v1
x1 (a	multi-hot	vector	with	40K	elements.	One	for	each	code)
v1 =	tanh(Wv
Tx1) (Transform	x to	a	compact	representation)
Hidden	Layer	h1
h1=	𝝈(Wi
Tv1)	
42
Straightforward	RNN	for	EHR
• RNN	now	accepts	multiple	medical	codes	at	each	timestep (aka	visit)
x1
v1
Hidden	Layer	h1
x2
v2
Hidden	Layer	h2
h2=	𝝈(Wh
Th1 + Wi
Tv2)	
43
Straightforward	RNN	for	EHR
• RNN	now	accepts	multiple	medical	codes	at	each	timestep (aka	visit)
x1
v1
Hidden	Layer	h1
x2
v2
Hidden	Layer	h2
xT
vT
Hidden	Layer	hT
hT=	𝝈(Wh
ThT-1 + Wi
TvT-1)	
44
Straightforward	RNN	for	EHR
• RNN	now	accepts	multiple	medical	codes	at	each	timestep (aka	visit)
Hidden	Layer	hT
Output
y =	𝝈(wo
ThT)
Outcome	0.0	~	1.0
45
RETAIN:	Motivation
• Which	visit	contributes	more	to	the	final	prediction?
x1
v1
Hidden	Layer	h1
x2
v2
Hidden	Layer	h2
xT
vT
Hidden	Layer	hT
46
RETAIN:	Motivation
• Within	a	single	visit,	which	code	contributes	more	to	the	prediction?
v1
Hidden	Layer	h1
v2
Hidden	Layer	h2
vT
Hidden	Layer	hT
1,	0,	0,	1,	0,	0,	0,	1,	0,	0,	0,	0,	…
cough fever tylenol pneumonia 47
RETAIN:	Design	Choices
48
		"#
		$#
		%#
		&#
⨀
		(# 		"# 		$#
		%#
		&#
⨀ 		(#
		)#
		*#
Standard	attention	model RETAIN
RETAIN:	Design	Choices
49
		"#
		$#
		%#
		&#
⨀
		(# 		"# 		$#
		%#
		&#
⨀ 		(#
		)#
		*#
Standard	attention	model RETAIN
MLP	embeds	the	visitsRNN	embeds	the	visits
RETAIN:	Design	Choices
50
		"#
		$#
		%#
		&#
⨀
		(# 		"# 		$#
		%#
		&#
⨀ 		(#
		)#
		*#
Standard	attention	model RETAIN
RNN	generates	
attention	for	
the	visits
MLP	generates	
attentions	for	
the	visits
RETAIN:	Design	Choices
51
		"#
		$#
		%#
		&#
⨀
		(# 		"# 		$#
		%#
		&#
⨀ 		(#
		)#
		*#
Standard	attention	model RETAIN
Another	RNN	generates
attentions	for	the	codes	
within	each	visit
RETAIN:	Design	Choices
52
		"#
		$#
		%#
		&#
⨀
		(# 		"# 		$#
		%#
		&#
⨀ 		(#
		)#
		*#
Standard	attention	model RETAIN
Visits	are	combined	for	prediction Visits	are	combined	for	prediction
RETAIN:	Design	Choices
53
		"#
		$#
		%#
		&#
⨀
		(# 		"# 		$#
		%#
		&#
⨀ 		(#
		)#
		*#
Standard	attention	model RETAIN
Less	interpretable	end-to-end Interpretable	end-to-end
RETAIN:	Model	Architecture
		,# 	,) 		,*
		&# 		&) 		&*
		"# ") 		"*
		$# 		$) 		$*
		+# 	+) 		+*
		'# 		') 		'*
Σ
	.* 	/*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict54
RETAIN:	Model	Architecture
		,# 	,) 		,*
		&# 		&) 		&*
		"# ") 		"*
		$# 		$) 		$*
		+# 	+) 		+*
		'# 		') 		'*
Σ
	.* 	/*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict55
an RNN. To find the j-th word in the target language, we generate attentions ↵i
word in the original sentence. Then, we compute the context vector cj =
P
i ↵j
i hi
j-th word in the target language. In general, the attention mechanism allows the mo
word (or words) in the given sentence when generating each word in the target lan
In this work, we define a temporal attention mechanism to provide interpreta
healthcare. Doctors generally pay attention to specific clinical information (e.g., k
timing when reviewing EHR data. We exploit this insight to develop a temporal atte
doctors’ practice, which will be introduced next.
2.2 Reverse Time Attention Model RETAIN
Figure 2 shows the high-level overview of our model. One key idea is to delegate a
the prediction responsibility to the attention weights generation process. RNNs bec
due to the recurrent weights feeding past information to the hidden layer. Theref
visit-level and the variable-level (individual coordinates of xi) influence, we use a
input vector xi. That is, we define
vi = Exi,
where vi 2 Rm
denotes the embedding of the input vector xi 2 Rr
, m the size of t
E 2 Rm⇥r
the embedding matrix to learn. We can easily choose a more sophisticat
RETAIN:	Model	Architecture
		,# 	,) 		,*
		&# 		&) 		&*
		"# ") 		"*
		$# 		$) 		$*
		+# 	+) 		+*
		'# 		') 		'*
Σ
	.* 	/*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict
where vi 2 Rm
denotes the embedding of the input vector xi 2 Rr
, m th100
dimension, Wemb 2 Rm⇥r
the embedding matrix to learn. We can easily cho101
but still interpretable representation such as multilayer perceptron (MLP)102
used for representation learning in EHR data [10].103
We use two sets of weights for the visit-level attention and the variable-lev104
The scalars ↵1, . . . , ↵i are the visit-level attention weights that govern th105
embedding v1, . . . , vi. The vectors 1, . . . , i are the variable-level attenti106
each coordinate of the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, .107
We use two RNNs, RNN↵ and RNN , to separately generate ↵’s and ’s a108
gi, gi 1, . . . , g1 = RNN↵(vi, vi 1, . . . , v1),
ej = w>
↵ gj + b↵, for j = 1, . . . , i
↵1, ↵2, . . . , ↵i = Softmax(e1, e2, . . . , ei)
hi, hi 1, . . . , h1 = RNN (vi, vi 1, . . . , v1)
j = tanh W hj + b for j = 1,
where gi 2 Rp
is the hidden layer of RNN↵ at time step i, hi 2 Rq
the109
at time step i and w↵ 2 Rp
, b↵ 2 R, W 2 Rm⇥q
and b 2 Rm
are110
The hyperparameters p and q determine the hidden layer size of RNN↵ a111
3
56
RETAIN:	Model	Architecture
		,# 	,) 		,*
		&# 		&) 		&*
		"# ") 		"*
		$# 		$) 		$*
		+# 	+) 		+*
		'# 		') 		'*
Σ
	.* 	/*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict
mb 2 R the embedding matrix to learn. We can easily choose a more sophisticated
table representation such as multilayer perceptron (MLP) [13, 29] which has been
ntation learning in EHR data [10].
of weights for the visit-level attention and the variable-level attention, respectively.
. . . , ↵i are the visit-level attention weights that govern the influence of each visit
. . , vi. The vectors 1, . . . , i are the variable-level attention weights that focus on
of the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, . . . , vi,m.
Ns, RNN↵ and RNN , to separately generate ↵’s and ’s as follows,
gi, gi 1, . . . , g1 = RNN↵(vi, vi 1, . . . , v1),
ej = w>
↵ gj + b↵, for j = 1, . . . , i
↵1, ↵2, . . . , ↵i = Softmax(e1, e2, . . . , ei) (Step 2)
hi, hi 1, . . . , h1 = RNN (vi, vi 1, . . . , v1)
j = tanh W hj + b for j = 1, . . . , i, (Step 3)
is the hidden layer of RNN↵ at time step i, hi 2 Rq
the hidden layer of RNN
nd w↵ 2 Rp
, b↵ 2 R, W 2 Rm⇥q
and b 2 Rm
are the parameters to learn.
meters p and q determine the hidden layer size of RNN↵ and RNN , respectively.
3
57
RETAIN:	Model	Architecture
		,# 	,) 		,*
		&# 		&) 		&*
		"# ") 		"*
		$# 		$) 		$*
		+# 	+) 		+*
		'# 		') 		'*
Σ
	.* 	/*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict
records, they typically study the patient’s most recent records first, and go back in time.
ationally, running the RNN in reversed time order has several advantages as well: The reverse
der allows us to generate e’s and ’s that dynamically change their values when making
ons at different time steps i = 1, 2, . . . , T. It ensures that the attention vectors will be different
timestamp and makes the attention generation process computationally more stable.1
erate the context vector ci for a patient up to the i-th visit as follows,
ci =
iX
j=1
↵j j vj, (Step 4)
denotes element-wise multiplication. We use the context vector ci 2 Rm
to predict the true
2 {0, 1}s
as follows,
byi = Softmax(Wci + b), (Step 5)
W 2 Rs⇥m
and b 2 Rs
are parameters to learn. We use the cross-entropy to calculate the
ation loss as follows,
L(x1, . . . , xT ) =
1
N
NX
n=1
1
T(n)
T (n)
X
i=1
⇣
y>
i log(byi) + (1 yi)>
log(1 byi)
⌘
(1)
we sum the cross entropy errors from all dimensions of byi. In case of real-valued output
, we can change the cross-entropy in Eq. (1) to for example mean squared error.
our attention mechanism can be viewed as the inverted architecture of the standard attention
ism for NLP [2] where the words are encoded using RNN and generate the attention weights
58
RETAIN:	Model	Architecture
		,# 	,) 		,*
		&# 		&) 		&*
		"# ") 		"*
		$# 		$) 		$*
		+# 	+) 		+*
		'# 		') 		'*
Σ
	.* 	/*
5
⨀ ⨀ ⨀
1
23
4
011& 0112
Time
ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict
ci =
iX
j=1
↵j j vj,
where denotes element-wise multiplication. We use the context vector ci 2123
label yi 2 {0, 1}s
as follows,124
byi = Softmax(Wci + b),
where W 2 Rs⇥m
and b 2 Rs
are parameters to learn. We use the cross-en125
classification loss as follows,126
L(x1, . . . , xT ) =
1
N
NX
n=1
1
T(n)
T (n)
X
i=1
⇣
y>
i log(byi) + (1 yi)>
log
where we sum the cross entropy errors from all dimensions of byi. In case127
yi 2 Rs
, we can change the cross-entropy in Eq. (1) to for example mean squ128
Overall, our attention mechanism can be viewed as the inverted architecture of129
mechanism for NLP [2] where the words are encoded using RNN and generate130
using MLP. Our method, on the other hand, uses MLP to embed the visit in131
interpretation and uses RNN to generate two sets of attention weights, reco132
information as well as mimicking the behavior of physicians.133
59
RETAIN:	Calculating	the	Contributions
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
60
RETAIN:	Calculating	the	Contributions
the past records, they typically study the patient’s most recent records fi117
Computationally, running the RNN in reversed time order has several advan118
time order allows us to generate e’s and ’s that dynamically change th119
predictions at different time steps i = 1, 2, . . . , T. It ensures that the attentio120
at each timestamp and makes the attention generation process computation121
We generate the context vector ci for a patient up to the i-th visit as follow122
ci =
iX
j=1
↵j j vj,
where denotes element-wise multiplication. We use the context vector ci123
label yi 2 {0, 1}s
as follows,124
byi = Softmax(Wci + b),
where W 2 Rs⇥m
and b 2 Rs
are parameters to learn. We use the cross125
classification loss as follows,126
L(x1, . . . , xT ) =
1
N
NX
n=1
1
T(n)
T (n)
X
i=1
⇣
y>
i log(byi) + (1 yi)>
where we sum the cross entropy errors from all dimensions of byi. In ca127
yi 2 Rs
, we can change the cross-entropy in Eq. (1) to for example mean128
Overall, our attention mechanism can be viewed as the inverted architecture129
mechanism for NLP [2] where the words are encoded using RNN and gene130
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
61
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i,
RETAIN:	Calculating	the	Contributions
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i,
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
62
2.2 Reverse Time Attention Model RETAIN
Figure 2 shows the high-level overview of our model. One key idea is
the prediction responsibility to the attention weights generation proces
due to the recurrent weights feeding past information to the hidden l
visit-level and the variable-level (individual coordinates of xi) influen
input vector xi. That is, we define
vi = Exi,
where vi 2 Rm
denotes the embedding of the input vector xi 2 Rr
, m
E 2 Rm⇥r
the embedding matrix to learn. We can easily choose a mor
representation such as multilayer perceptron (MLP) [13, 28] which has
in EHR data [10].
We use two sets of weights for the visit-level attention and the vari
scalars ↵1, . . . , ↵i are the visit-level attention weights that govern th
v1, . . . , vi. The vectors 1, . . . , i are the variable-level attention weig
the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, . . . , vi,m.
We use two RNNs, RNN↵ and RNN , to separately generate ↵’s a
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
k-th column	of	E
RETAIN:	Calculating	the	Contributions
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
Inside	the	iteration	over	k
63
e in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the largest change in
e the input variable with highest contribution. More formally, given the sequence x1, . . . , xi, we are
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
n be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(y , x ) = ↵ W( e ) x , (5)
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i,
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
Scalars	in	the	front
RETAIN:	Calculating	the	Contributions
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
64
1 i
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
n be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i,
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
RETAIN:	Calculating	the	Contributions
e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed
ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s
f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
ange in yi,d will be the input variable with highest contribution. More formally, given the
x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
pressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
f xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
Contribution	of	the	k-th code	in	the	j-th visit
65
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
e visit embedding vi is the sum of the columns of E weighted by each element of xi,
en as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
econstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
i is omitted in the ↵j and j. As we have described in Section 2.2, we are generating
1 i
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
n be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the
argest change in yi,d will be the input variable with highest contribution. More formally, given the
equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s
, which
an be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
where ci 2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each
lement of xi, Eq (3) can be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,kWemb[:, k]
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j Wemb[:, k]
⌘
+ b
◆
(4)
where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the
kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi.
herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i,
predict the probability of the output vector yi 2 {0, 1}s
, which can be expressed as follows
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2)
2 Rm
denotes the context vector. According to Step 4, ci is the sum of the visit embeddings
weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows,
p(yi|x1, . . . , xi) = p(yi|ci) = Softmax
✓
W
⇣ iX
j=1
↵j j vj
⌘
+ b
◆
(3)
fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi,
be rewritten as follows,
p(yi|x1, . . . , xi) = Softmax
✓
W
⇣ iX
j=1
↵j j
rX
k=1
xj,ke:,k
⌘
+ b
◆
= Softmax
✓ iX
j=1
rX
k=1
xj,k ↵jW
⇣
j e:,k
⌘
+ b
◆
(4)
is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of
completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate
bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows,
!(yi, xj,k) = ↵jW( j e:,k)
| {z }
Contribution coefficient
xj,k
|{z}
Input value
, (5)
Experiments	&	Results
Heart	Failure	(HF)	Prediction
• Objective
• Given	a	patient	record,	predict	whether	he/she	will	be	diagnosed	with	HF	in	the	
future
• 34K	patients	from	Sutter	PAMF
• 4K	cases,	30K	controls
• Use	18-months	history	before	being	diagnosed	with	HF
• Number	of	medical	codes
• 283	diagnosis	codes
• 96	medication	codes
• 238	procedure	codes
67
617	medical	codes
Heart	failure	prediction
• Performance	measure
• Area	under	the	ROC	curve	(AUC)
• Competing	models
• Logistic	regression
• Aggregate	all	past	codes	into	a	fixed-size	vector.	Feed	it	to	LR
• MLP
• Aggregate	all	past	codes	into	a	fixed-size	vector.	Feed	it	to	MLP
• Two-layer	RNN
• Visits	are	fed	to	the	RNN,	whose	hidden	layers	are	fed	to	another	RNN.
• RNN+attention (Bahdanau et	al.	2014)
• Visits	are	fed	to	RNN.	Visit-level	attentions	are	generated	by	MLP
• RETAIN
68
Heart	failure	prediction
Models AUC Training time	/	epoch Test	time	for	5K	patients
Logistic	Regression 0.7900	± 0.0111	 0.15s 0.11s
MLP 0.8256	± 0.0096 0.25s 0.11s
Two-layer	RNN 0.8706	± 0.0080	 10.3s 0.57s
RNN+attention 0.8624	± 0.0079 6.7s 0.48s
RETAIN 0.8705	± 0.0081 10.8s 0.63s
• RETAIN	as	accurate	as	RNN
• Requires	similar	training	time	&	test	time
• RETAIN	is	interpretable!
• RNN	is	a	blackbox
69
RETAIN	visualization
• Demo
70
Conclusion
• RETAIN:	interpretable	prediction	framework
• As	accurate	as	RNN
• Interpretable	prediction
• Predictions	can	be	explained
• Can	be	extended	to	general	prognosis
• What	are	the	likely	disease	he/she	will	have	in	the	future?
• Can	be	used	for	any	sequences	with	the	two-layer	structure
• E.g.	online	shopping
71
Interpretable	Deep	Learning	
for	Healthcare
Edward	Choi	(mp2893@gatech.edu)
Jimeng Sun	(jsun@cc.gatech.edu)
SunLab (sunlab.org)
How	to	generate	the	attentions	𝛼.?
• Use	another	neural	network	model
Input	Layer	x
Hidden	Layer	h
Output	y
x
h =	𝝈(Wh
Tx)
y =	wo
Th (outcome	−∞~ + ∞)
Let’s	call	this	function	y=a(x)
73
How	to	generate	the	attentions	𝛼.?
• Use	function	a(x)	for	each	word:	Justice,	League,	…,	Christmas,	play
• Feed	the	scores	y1,	y2,	…,	y10 into	the	Softmax function
League playJustice
a(x1)
y1
a(x2)
y2
a(x10)
y10
𝛼. =
exp	( 𝑦.)
∑ exp	( 𝑦:)#&
:;#
Christmas
a(x9)
y9
74
How	to	generate	the	attentions	𝛼.?
• Use	function	a(x)	for	each	word:	Justice,	League,	…,	Christmas,	play
• Feed	the	scores	y1,	y2,	…,	y10 into	the	Softmax function
League playJustice
a(x1)
y1
a(x2)
y2
a(x10)
y10
𝛼. =
exp	( 𝑦.)
∑ exp	( 𝑦:)#&
:;#
Christmas
a(x9)
y9
Softmax function	ensures	𝛼.’s	sum	to	1	
Return75

More Related Content

What's hot

Build, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at ScaleBuild, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at Scale
Amazon Web Services
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
Dmitry Kan
 
SAP Cloud Platform Product Overview
SAP Cloud Platform Product OverviewSAP Cloud Platform Product Overview
SAP Cloud Platform Product Overview
SAP Cloud Platform
 
How to Set Up a Cloud Cost Optimization Process for your Enterprise
How to Set Up a Cloud Cost Optimization Process for your EnterpriseHow to Set Up a Cloud Cost Optimization Process for your Enterprise
How to Set Up a Cloud Cost Optimization Process for your Enterprise
RightScale
 
SAP HANA Migration Deck.pptx
SAP HANA Migration Deck.pptxSAP HANA Migration Deck.pptx
SAP HANA Migration Deck.pptx
SingbBablu
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data Science
Harald Erb
 
SAP’s Intelligent Enterprise Strategy
SAP’s Intelligent Enterprise StrategySAP’s Intelligent Enterprise Strategy
SAP’s Intelligent Enterprise Strategy
AGSanePLDTCompany
 
Monitoring Models in Production
Monitoring Models in ProductionMonitoring Models in Production
Monitoring Models in Production
Jannes Klaas
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
Ishan Bhawantha Hewanayake
 
SAP SuccessFactors Global Benefits
SAP SuccessFactors Global BenefitsSAP SuccessFactors Global Benefits
SAP SuccessFactors Global Benefits
Juan Andres Peiro
 
Executing a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSExecuting a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWS
Amazon Web Services
 
Hackathon winning pitch
Hackathon winning pitchHackathon winning pitch
Hackathon winning pitch
Anand Inbasekaran, MBA, B.Tech, TOGAF
 
Introduction to Neo4j
Introduction to Neo4jIntroduction to Neo4j
Introduction to Neo4j
Neo4j
 
Anything-to-Graph
Anything-to-GraphAnything-to-Graph
Anything-to-Graph
Joshua Shinavier
 
Data Analytics Strategies & Solutions for SAP customers
Data Analytics Strategies & Solutions for SAP customersData Analytics Strategies & Solutions for SAP customers
Data Analytics Strategies & Solutions for SAP customers
Visual_BI
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
Tamer Rezk
 
Migrating SAP Workloads to AWS: Stories and Tips - AWS Summit Sydney
Migrating SAP Workloads to AWS: Stories and Tips - AWS Summit SydneyMigrating SAP Workloads to AWS: Stories and Tips - AWS Summit Sydney
Migrating SAP Workloads to AWS: Stories and Tips - AWS Summit Sydney
Amazon Web Services
 
Highway to S/4 HANA
Highway to S/4 HANAHighway to S/4 HANA
Highway to S/4 HANA
Capgemini
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActions
Rob Winters
 
Azure Application Architecture Guide
Azure Application Architecture GuideAzure Application Architecture Guide
Azure Application Architecture Guide
Masashi Narumoto
 

What's hot (20)

Build, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at ScaleBuild, Train & Deploy Machine Learning Models at Scale
Build, Train & Deploy Machine Learning Models at Scale
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
SAP Cloud Platform Product Overview
SAP Cloud Platform Product OverviewSAP Cloud Platform Product Overview
SAP Cloud Platform Product Overview
 
How to Set Up a Cloud Cost Optimization Process for your Enterprise
How to Set Up a Cloud Cost Optimization Process for your EnterpriseHow to Set Up a Cloud Cost Optimization Process for your Enterprise
How to Set Up a Cloud Cost Optimization Process for your Enterprise
 
SAP HANA Migration Deck.pptx
SAP HANA Migration Deck.pptxSAP HANA Migration Deck.pptx
SAP HANA Migration Deck.pptx
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data Science
 
SAP’s Intelligent Enterprise Strategy
SAP’s Intelligent Enterprise StrategySAP’s Intelligent Enterprise Strategy
SAP’s Intelligent Enterprise Strategy
 
Monitoring Models in Production
Monitoring Models in ProductionMonitoring Models in Production
Monitoring Models in Production
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
SAP SuccessFactors Global Benefits
SAP SuccessFactors Global BenefitsSAP SuccessFactors Global Benefits
SAP SuccessFactors Global Benefits
 
Executing a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWSExecuting a Large-Scale Migration to AWS
Executing a Large-Scale Migration to AWS
 
Hackathon winning pitch
Hackathon winning pitchHackathon winning pitch
Hackathon winning pitch
 
Introduction to Neo4j
Introduction to Neo4jIntroduction to Neo4j
Introduction to Neo4j
 
Anything-to-Graph
Anything-to-GraphAnything-to-Graph
Anything-to-Graph
 
Data Analytics Strategies & Solutions for SAP customers
Data Analytics Strategies & Solutions for SAP customersData Analytics Strategies & Solutions for SAP customers
Data Analytics Strategies & Solutions for SAP customers
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
Migrating SAP Workloads to AWS: Stories and Tips - AWS Summit Sydney
Migrating SAP Workloads to AWS: Stories and Tips - AWS Summit SydneyMigrating SAP Workloads to AWS: Stories and Tips - AWS Summit Sydney
Migrating SAP Workloads to AWS: Stories and Tips - AWS Summit Sydney
 
Highway to S/4 HANA
Highway to S/4 HANAHighway to S/4 HANA
Highway to S/4 HANA
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActions
 
Azure Application Architecture Guide
Azure Application Architecture GuideAzure Application Architecture Guide
Azure Application Architecture Guide
 

Similar to Interpretable deep learning for healthcare

A_R_Gottu_Mukkula_Escape_26.pptx
A_R_Gottu_Mukkula_Escape_26.pptxA_R_Gottu_Mukkula_Escape_26.pptx
A_R_Gottu_Mukkula_Escape_26.pptx
AnweshReddy22
 
[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies
[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies
[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies
YONG ZHENG
 
How to Set Up and Run Hypothesis Tests
How to Set Up and Run Hypothesis TestsHow to Set Up and Run Hypothesis Tests
How to Set Up and Run Hypothesis Tests
GoLeanSixSigma.com
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment system
KOYELMAJUMDAR1
 
Thinking statistically v3
Thinking statistically v3Thinking statistically v3
Thinking statistically v3
Stephen Senn
 
Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22
Matthias Schuurmans
 
LeanScape - Lean Six Sigma Green Belt Book of Knowledge
LeanScape - Lean Six Sigma Green Belt Book of KnowledgeLeanScape - Lean Six Sigma Green Belt Book of Knowledge
LeanScape - Lean Six Sigma Green Belt Book of Knowledge
Reagan Pannell
 
Estimation & estimate Prof. rasheda samad,
Estimation & estimate Prof.  rasheda samad, Estimation & estimate Prof.  rasheda samad,
Estimation & estimate Prof. rasheda samad,
rashedadr
 
SEQme qPCR Course 2017_ENG-SlideShare
SEQme qPCR Course 2017_ENG-SlideShareSEQme qPCR Course 2017_ENG-SlideShare
SEQme qPCR Course 2017_ENG-SlideShare
Richard Nadvornik
 
Large Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSLarge Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVS
Golden Helix
 
Dip
DipDip
Feedbackdriven radiologyreportretrieval ichi2015-v2
Feedbackdriven radiologyreportretrieval ichi2015-v2Feedbackdriven radiologyreportretrieval ichi2015-v2
Feedbackdriven radiologyreportretrieval ichi2015-v2
Artificial Intelligence Institute at UofSC
 
A Concentric-based Approach to Represent News Topics in Tweets
A Concentric-based Approach to Represent News Topics in TweetsA Concentric-based Approach to Represent News Topics in Tweets
A Concentric-based Approach to Represent News Topics in Tweets
CrowdTruth
 
Lean in Hospitals - Lean Transformation in Gwent
Lean in Hospitals - Lean Transformation in GwentLean in Hospitals - Lean Transformation in Gwent
Lean in Hospitals - Lean Transformation in Gwent
Lean Enterprise Academy
 
Big data vs the RCT - Derek Angus - SSAI2017
Big data vs the RCT - Derek Angus - SSAI2017Big data vs the RCT - Derek Angus - SSAI2017
Big data vs the RCT - Derek Angus - SSAI2017
scanFOAM
 
A3 thinking nhsiq 2014
A3 thinking  nhsiq 2014A3 thinking  nhsiq 2014
A3 thinking nhsiq 2014
NHS Improving Quality
 
Daming
DamingDaming
10th Annual Utah's Health Services Research Conference - Iterative Developmen...
10th Annual Utah's Health Services Research Conference - Iterative Developmen...10th Annual Utah's Health Services Research Conference - Iterative Developmen...
10th Annual Utah's Health Services Research Conference - Iterative Developmen...
Utah's Annual Health Services Research Conference
 

Similar to Interpretable deep learning for healthcare (20)

A_R_Gottu_Mukkula_Escape_26.pptx
A_R_Gottu_Mukkula_Escape_26.pptxA_R_Gottu_Mukkula_Escape_26.pptx
A_R_Gottu_Mukkula_Escape_26.pptx
 
[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies
[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies
[WI 2017] Context Suggestion: Empirical Evaluations vs User Studies
 
data mining
data miningdata mining
data mining
 
Part1
Part1Part1
Part1
 
How to Set Up and Run Hypothesis Tests
How to Set Up and Run Hypothesis TestsHow to Set Up and Run Hypothesis Tests
How to Set Up and Run Hypothesis Tests
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment system
 
Thinking statistically v3
Thinking statistically v3Thinking statistically v3
Thinking statistically v3
 
Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22Behind The Scenes Data Science Coolblue 2018-03-22
Behind The Scenes Data Science Coolblue 2018-03-22
 
LeanScape - Lean Six Sigma Green Belt Book of Knowledge
LeanScape - Lean Six Sigma Green Belt Book of KnowledgeLeanScape - Lean Six Sigma Green Belt Book of Knowledge
LeanScape - Lean Six Sigma Green Belt Book of Knowledge
 
Estimation & estimate Prof. rasheda samad,
Estimation & estimate Prof.  rasheda samad, Estimation & estimate Prof.  rasheda samad,
Estimation & estimate Prof. rasheda samad,
 
SEQme qPCR Course 2017_ENG-SlideShare
SEQme qPCR Course 2017_ENG-SlideShareSEQme qPCR Course 2017_ENG-SlideShare
SEQme qPCR Course 2017_ENG-SlideShare
 
Large Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSLarge Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVS
 
Dip
DipDip
Dip
 
Feedbackdriven radiologyreportretrieval ichi2015-v2
Feedbackdriven radiologyreportretrieval ichi2015-v2Feedbackdriven radiologyreportretrieval ichi2015-v2
Feedbackdriven radiologyreportretrieval ichi2015-v2
 
A Concentric-based Approach to Represent News Topics in Tweets
A Concentric-based Approach to Represent News Topics in TweetsA Concentric-based Approach to Represent News Topics in Tweets
A Concentric-based Approach to Represent News Topics in Tweets
 
Lean in Hospitals - Lean Transformation in Gwent
Lean in Hospitals - Lean Transformation in GwentLean in Hospitals - Lean Transformation in Gwent
Lean in Hospitals - Lean Transformation in Gwent
 
Big data vs the RCT - Derek Angus - SSAI2017
Big data vs the RCT - Derek Angus - SSAI2017Big data vs the RCT - Derek Angus - SSAI2017
Big data vs the RCT - Derek Angus - SSAI2017
 
A3 thinking nhsiq 2014
A3 thinking  nhsiq 2014A3 thinking  nhsiq 2014
A3 thinking nhsiq 2014
 
Daming
DamingDaming
Daming
 
10th Annual Utah's Health Services Research Conference - Iterative Developmen...
10th Annual Utah's Health Services Research Conference - Iterative Developmen...10th Annual Utah's Health Services Research Conference - Iterative Developmen...
10th Annual Utah's Health Services Research Conference - Iterative Developmen...
 

More from NAVER Engineering

React vac pattern
React vac patternReact vac pattern
React vac pattern
NAVER Engineering
 
디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX
NAVER Engineering
 
진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)
NAVER Engineering
 
서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트
NAVER Engineering
 
BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호
NAVER Engineering
 
이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라
NAVER Engineering
 
날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기
NAVER Engineering
 
쏘카프레임 구축 배경과 과정
 쏘카프레임 구축 배경과 과정 쏘카프레임 구축 배경과 과정
쏘카프레임 구축 배경과 과정
NAVER Engineering
 
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
NAVER Engineering
 
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
NAVER Engineering
 
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
NAVER Engineering
 
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
NAVER Engineering
 
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
NAVER Engineering
 
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
NAVER Engineering
 
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
NAVER Engineering
 
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
NAVER Engineering
 
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
NAVER Engineering
 
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
NAVER Engineering
 
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
NAVER Engineering
 
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
NAVER Engineering
 

More from NAVER Engineering (20)

React vac pattern
React vac patternReact vac pattern
React vac pattern
 
디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX
 
진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)
 
서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트
 
BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호
 
이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라
 
날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기
 
쏘카프레임 구축 배경과 과정
 쏘카프레임 구축 배경과 과정 쏘카프레임 구축 배경과 과정
쏘카프레임 구축 배경과 과정
 
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
 
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
 
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
 
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
 
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
 
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
 
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
 
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
 
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
 
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
 
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
 
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
 

Recently uploaded

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 

Interpretable deep learning for healthcare

  • 2. Index • Healthcare & Machine Learning • Sequence Prediction with RNN • Attention mechanism & interpretable prediction • Proposed model: RETAIN • Experiments & results • Conclusion 2
  • 4. SunLab & Healthcare • SunLab & Collaborators ProviderGovernment University Company 4
  • 5. SunLab Healthcare Projects • Predictive analytics pipeline & Bayesian optimization • Patient phenotyping • Treatment recommendation • Epilepsy patient prediction • Heart failure prediction • Disease progression modeling 5
  • 6. SunLab Healthcare Projects • Predictive analytics pipeline & Bayesian optimization • Patient phenotyping • Treatment recommendation • Epilepsy patient prediction • Heart failure prediction • Disease progression modeling 6
  • 7. SunLab Healthcare Projects • Predictive analytics pipeline & Bayesian optimization • Patient phenotyping • Treatment recommendation • Epilepsy patient prediction • Heart failure prediction • Disease progression modeling 7 Observation Window Diagnosis Date Prediction Window Index Date Time
  • 8. SunLab Healthcare Projects • Predictive analytics pipeline & Bayesian optimization • Patient phenotyping • Treatment recommendation • Epilepsy patient prediction • Heart failure prediction • Disease progression modeling 8 Cough Visit 1 Fever Fever Visit 2 Chill Fever Visit 3 Pneumonia Chest X-ray Tylenol IV fluid
  • 9. SunLab Healthcare Projects • Predictive analytics pipeline & Bayesian optimization • Patient phenotyping • Treatment recommendation • Epilepsy patient prediction • Heart failure prediction • Disease progression modeling 9 Recurrent Neural Network (RNN)
  • 11. Sequence Prediction - NLP • Given a sequence of symbols, predict a certain outcome. • Is the given sentence positive or negative ? • “Justice” “League” “is” “as” “impressive” “as” “a” “preschool” “Christmas” “play” • Each word is a symbol • Outcome: 0, 1 (binary) • The sentence is either positive or negative. 11
  • 12. Sequence Prediction - EHR • Given a sequence of symbols, predict a certain outcome. • Given a diagnosis history, will the patient have heart failure? • Hypertension, Hypertension, Diabetes, CKD, CKD, Diabetes, MI • Each diagnosis is a symbol • Outcome: 0, 1 (binary) • Either you have HF, or you don’t 12
  • 13. What is sequence prediction? • Given a sequence of symbols, predict a certain outcome. • Where is the boundary between exons and introns in the DNA sequence? • What is the French translation of the given English sentence? • Given a diagnosis history, what will he/she have in the next visit? 13
  • 14. Sequence prediction with MLP • Let’s start with a simple Multi-layer Perceptron (MLP) • Sentiment classification (positive or negative?) • “justice leagues was as impressive as a preschool christmas play” 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, … x (a vector with 1M elements. One for each word) boy cat justice preschool fun 14
  • 23. Limitation of RNN • Transparency • RNN is a blackbox • Feed input, receive output • Hard to tell what caused the outcome 23
  • 24. Limitation of RNN • Transparency • RNN is a blackbox • Feed input, receive output • Hard to tell what caused the outcome • Outcome 0.9 • Was it because of “Justice”? • Was it because of “impressive”? • Was it because of “Christmas”? 24
  • 25. Limitation of RNN • Transparency • RNN is a blackbox • Feed input, receive output • Hard to tell what caused the outcome h9 h10 h2 League Christmas play h1 Justice All inputs accumulated here 25
  • 27. Attention models • Bahdanau, Cho, Bengio, 2014 • English-French translation using RNN • Let’s use hidden layers from all timesteps to make predictions 27
  • 28. Attention models • Bahdanau, Cho, Bengio, 2014 • English-French translation using RNN • Let’s use hidden layers from all timesteps to make predictions h9 h10 h2 League Christmas play h1 Justice 28
  • 29. Attention models • Bahdanau, Cho, Bengio, 2014 • English-French translation using RNN • Let’s use hidden layers from all timesteps to make predictions h9 h10 h2 League Christmas play h1 Justice c 𝛼# 𝛼$ 𝛼% 𝛼#& 𝛼# + 𝛼$ + ⋯ + 𝛼#& = 1 𝒄 = 𝛼# 𝒉# + 𝛼$ 𝒉$ + ⋯ + 𝛼#& 𝒉#& 29
  • 30. Attention models • Bahdanau, Cho, Bengio, 2014 • English-French translation using RNN • Let’s use hidden layers from all timesteps to make predictions Outputc 𝛼# 𝛼$ 𝛼% 𝛼#& y = 𝝈(wo Tc) h9 h10 h2 League Christmas play h1 Justice 30
  • 32. Attention models • Attention, what is it good for? • c is an explicit combination of all past information • 𝛼#, 𝛼$, ⋯ , 𝛼#& denote the usefulness from each word • We can tell which word was used the most/least to the outcome c 𝛼# 𝛼$ 𝛼% 𝛼#& 32
  • 33. Attention models • Attention, what is it good for? • Now c is an explicit combination of all past information • 𝛼#, 𝛼$, ⋯ , 𝛼#& denote the usefulness from each word • We can tell which word was used the most/least to the outcome • Attentions 𝛼. are generated using an MLP c 𝛼# 𝛼$ 𝛼% 𝛼#& 33
  • 34. Attention Example • English-French translation • Bahdanau, Cho, Bengio 2014 (a) (c) Figure3:FoursamplealignmentsfoundbyRN correspondtothewordsinthesourcesentence( respectively.Eachpixelshowstheweight↵ijoft targetword(seeEq.(6)),ingrayscale(0:black, randomlyselectedsamplesamongthesentencesw 10and20wordsfromthetestset. Oneofthemotivationsbehindtheproposedappr inthebasicencoder–decoderapproach.Wecon encoder–decoderapproachtounderperformwith manceofRNNencdecdramaticallydropsasthelen bothRNNsearch-30andRNNsearch-50aremore 50,especially,showsnoperformancedeterioratio superiorityoftheproposedmodeloverthebasic 34
  • 36. Structure of EHR • Assumption so far • Word sequence = Dx sequence • Justice, League, is, as, impressive, as, … • Cough, Benzonatate, Fever, Pneumonia, Chest X-ray, Amoxicillin, ... 36
  • 37. Structure of EHR • Assumption so far • Word sequence = Dx sequence • Justice, League, is, as, impressive, as, … • Cough, Benzonatate, Fever, Pneumonia, Chest X-ray, Amoxicillin, ... Cough Benzonatate Fever Pneumonia Amoxicillin Chest X-ray Time 37
  • 38. Structure of EHR • Assumption so far • Word sequence = Dx sequence • Justice, League, is, as, impressive, as, … • Cough, Benzonatate, Fever, Pneumonia, Chest X-ray, Amoxicillin, ... Cough Benzonatate Fever Pneumonia Amoxicillin Chest X-ray Time 38
  • 39. Structure of EHR • Assumption so far • Word sequence = Dx sequence • Justice, League, is, as, impressive, as, … • Cough, Benzonatate, Fever, Pneumonia, Chest X-ray, Amoxicillin, ... Cough Visit 1 Fever Fever Visit 2 Chill Fever Visit 3 Pneumonia Chest X-ray Tylenol IV fluid 39
  • 40. Straightforward RNN for EHR • RNN now accepts multiple medical codes at each timestep (i.e. visit) 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, … x1 (First visit vector with 40K elements. One for each medical code) cough fever tylenol pneumonia 40 Cough Visit 1 Fever Tylenol IV fluid
  • 41. Straightforward RNN for EHR • RNN now accepts multiple medical codes at each timestep (aka visit) Input Layer x1 Embedding Layer v1 x1 (a multi-hot vector with 40K elements. One for each code) v1 = tanh(Wv Tx1) (Transform x to a compact representation) 41
  • 42. Straightforward RNN for EHR • RNN now accepts multiple medical codes at each timestep (aka visit) Input Layer x1 Embedding Layer v1 x1 (a multi-hot vector with 40K elements. One for each code) v1 = tanh(Wv Tx1) (Transform x to a compact representation) Hidden Layer h1 h1= 𝝈(Wi Tv1) 42
  • 48. RETAIN: Design Choices 48 "# $# %# &# ⨀ (# "# $# %# &# ⨀ (# )# *# Standard attention model RETAIN
  • 49. RETAIN: Design Choices 49 "# $# %# &# ⨀ (# "# $# %# &# ⨀ (# )# *# Standard attention model RETAIN MLP embeds the visitsRNN embeds the visits
  • 50. RETAIN: Design Choices 50 "# $# %# &# ⨀ (# "# $# %# &# ⨀ (# )# *# Standard attention model RETAIN RNN generates attention for the visits MLP generates attentions for the visits
  • 51. RETAIN: Design Choices 51 "# $# %# &# ⨀ (# "# $# %# &# ⨀ (# )# *# Standard attention model RETAIN Another RNN generates attentions for the codes within each visit
  • 52. RETAIN: Design Choices 52 "# $# %# &# ⨀ (# "# $# %# &# ⨀ (# )# *# Standard attention model RETAIN Visits are combined for prediction Visits are combined for prediction
  • 53. RETAIN: Design Choices 53 "# $# %# &# ⨀ (# "# $# %# &# ⨀ (# )# *# Standard attention model RETAIN Less interpretable end-to-end Interpretable end-to-end
  • 54. RETAIN: Model Architecture ,# ,) ,* &# &) &* "# ") "* $# $) $* +# +) +* '# ') '* Σ .* /* 5 ⨀ ⨀ ⨀ 1 23 4 011& 0112 Time ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict54
  • 55. RETAIN: Model Architecture ,# ,) ,* &# &) &* "# ") "* $# $) $* +# +) +* '# ') '* Σ .* /* 5 ⨀ ⨀ ⨀ 1 23 4 011& 0112 Time ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict55 an RNN. To find the j-th word in the target language, we generate attentions ↵i word in the original sentence. Then, we compute the context vector cj = P i ↵j i hi j-th word in the target language. In general, the attention mechanism allows the mo word (or words) in the given sentence when generating each word in the target lan In this work, we define a temporal attention mechanism to provide interpreta healthcare. Doctors generally pay attention to specific clinical information (e.g., k timing when reviewing EHR data. We exploit this insight to develop a temporal atte doctors’ practice, which will be introduced next. 2.2 Reverse Time Attention Model RETAIN Figure 2 shows the high-level overview of our model. One key idea is to delegate a the prediction responsibility to the attention weights generation process. RNNs bec due to the recurrent weights feeding past information to the hidden layer. Theref visit-level and the variable-level (individual coordinates of xi) influence, we use a input vector xi. That is, we define vi = Exi, where vi 2 Rm denotes the embedding of the input vector xi 2 Rr , m the size of t E 2 Rm⇥r the embedding matrix to learn. We can easily choose a more sophisticat
  • 56. RETAIN: Model Architecture ,# ,) ,* &# &) &* "# ") "* $# $) $* +# +) +* '# ') '* Σ .* /* 5 ⨀ ⨀ ⨀ 1 23 4 011& 0112 Time ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict where vi 2 Rm denotes the embedding of the input vector xi 2 Rr , m th100 dimension, Wemb 2 Rm⇥r the embedding matrix to learn. We can easily cho101 but still interpretable representation such as multilayer perceptron (MLP)102 used for representation learning in EHR data [10].103 We use two sets of weights for the visit-level attention and the variable-lev104 The scalars ↵1, . . . , ↵i are the visit-level attention weights that govern th105 embedding v1, . . . , vi. The vectors 1, . . . , i are the variable-level attenti106 each coordinate of the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, .107 We use two RNNs, RNN↵ and RNN , to separately generate ↵’s and ’s a108 gi, gi 1, . . . , g1 = RNN↵(vi, vi 1, . . . , v1), ej = w> ↵ gj + b↵, for j = 1, . . . , i ↵1, ↵2, . . . , ↵i = Softmax(e1, e2, . . . , ei) hi, hi 1, . . . , h1 = RNN (vi, vi 1, . . . , v1) j = tanh W hj + b for j = 1, where gi 2 Rp is the hidden layer of RNN↵ at time step i, hi 2 Rq the109 at time step i and w↵ 2 Rp , b↵ 2 R, W 2 Rm⇥q and b 2 Rm are110 The hyperparameters p and q determine the hidden layer size of RNN↵ a111 3 56
  • 57. RETAIN: Model Architecture ,# ,) ,* &# &) &* "# ") "* $# $) $* +# +) +* '# ') '* Σ .* /* 5 ⨀ ⨀ ⨀ 1 23 4 011& 0112 Time ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict mb 2 R the embedding matrix to learn. We can easily choose a more sophisticated table representation such as multilayer perceptron (MLP) [13, 29] which has been ntation learning in EHR data [10]. of weights for the visit-level attention and the variable-level attention, respectively. . . . , ↵i are the visit-level attention weights that govern the influence of each visit . . , vi. The vectors 1, . . . , i are the variable-level attention weights that focus on of the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, . . . , vi,m. Ns, RNN↵ and RNN , to separately generate ↵’s and ’s as follows, gi, gi 1, . . . , g1 = RNN↵(vi, vi 1, . . . , v1), ej = w> ↵ gj + b↵, for j = 1, . . . , i ↵1, ↵2, . . . , ↵i = Softmax(e1, e2, . . . , ei) (Step 2) hi, hi 1, . . . , h1 = RNN (vi, vi 1, . . . , v1) j = tanh W hj + b for j = 1, . . . , i, (Step 3) is the hidden layer of RNN↵ at time step i, hi 2 Rq the hidden layer of RNN nd w↵ 2 Rp , b↵ 2 R, W 2 Rm⇥q and b 2 Rm are the parameters to learn. meters p and q determine the hidden layer size of RNN↵ and RNN , respectively. 3 57
  • 58. RETAIN: Model Architecture ,# ,) ,* &# &) &* "# ") "* $# $) $* +# +) +* '# ') '* Σ .* /* 5 ⨀ ⨀ ⨀ 1 23 4 011& 0112 Time ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict records, they typically study the patient’s most recent records first, and go back in time. ationally, running the RNN in reversed time order has several advantages as well: The reverse der allows us to generate e’s and ’s that dynamically change their values when making ons at different time steps i = 1, 2, . . . , T. It ensures that the attention vectors will be different timestamp and makes the attention generation process computationally more stable.1 erate the context vector ci for a patient up to the i-th visit as follows, ci = iX j=1 ↵j j vj, (Step 4) denotes element-wise multiplication. We use the context vector ci 2 Rm to predict the true 2 {0, 1}s as follows, byi = Softmax(Wci + b), (Step 5) W 2 Rs⇥m and b 2 Rs are parameters to learn. We use the cross-entropy to calculate the ation loss as follows, L(x1, . . . , xT ) = 1 N NX n=1 1 T(n) T (n) X i=1 ⇣ y> i log(byi) + (1 yi)> log(1 byi) ⌘ (1) we sum the cross entropy errors from all dimensions of byi. In case of real-valued output , we can change the cross-entropy in Eq. (1) to for example mean squared error. our attention mechanism can be viewed as the inverted architecture of the standard attention ism for NLP [2] where the words are encoded using RNN and generate the attention weights 58
  • 59. RETAIN: Model Architecture ,# ,) ,* &# &) &* "# ") "* $# $) $* +# +) +* '# ') '* Σ .* /* 5 ⨀ ⨀ ⨀ 1 23 4 011& 0112 Time ure 2: Unfolded view of RETAIN’s architecture: Given input sequence x1, . . . , xi, we predict ci = iX j=1 ↵j j vj, where denotes element-wise multiplication. We use the context vector ci 2123 label yi 2 {0, 1}s as follows,124 byi = Softmax(Wci + b), where W 2 Rs⇥m and b 2 Rs are parameters to learn. We use the cross-en125 classification loss as follows,126 L(x1, . . . , xT ) = 1 N NX n=1 1 T(n) T (n) X i=1 ⇣ y> i log(byi) + (1 yi)> log where we sum the cross entropy errors from all dimensions of byi. In case127 yi 2 Rs , we can change the cross-entropy in Eq. (1) to for example mean squ128 Overall, our attention mechanism can be viewed as the inverted architecture of129 mechanism for NLP [2] where the words are encoded using RNN and generate130 using MLP. Our method, on the other hand, uses MLP to embed the visit in131 interpretation and uses RNN to generate two sets of attention weights, reco132 information as well as mimicking the behavior of physicians.133 59
  • 60. RETAIN: Calculating the Contributions e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the ange in yi,d will be the input variable with highest contribution. More formally, given the x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which pressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of Wemb weighted by each f xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the 60
  • 61. RETAIN: Calculating the Contributions the past records, they typically study the patient’s most recent records fi117 Computationally, running the RNN in reversed time order has several advan118 time order allows us to generate e’s and ’s that dynamically change th119 predictions at different time steps i = 1, 2, . . . , T. It ensures that the attentio120 at each timestamp and makes the attention generation process computation121 We generate the context vector ci for a patient up to the i-th visit as follow122 ci = iX j=1 ↵j j vj, where denotes element-wise multiplication. We use the context vector ci123 label yi 2 {0, 1}s as follows,124 byi = Softmax(Wci + b), where W 2 Rs⇥m and b 2 Rs are parameters to learn. We use the cross125 classification loss as follows,126 L(x1, . . . , xT ) = 1 N NX n=1 1 T(n) T (n) X i=1 ⇣ y> i log(byi) + (1 yi)> where we sum the cross entropy errors from all dimensions of byi. In ca127 yi 2 Rs , we can change the cross-entropy in Eq. (1) to for example mean128 Overall, our attention mechanism can be viewed as the inverted architecture129 mechanism for NLP [2] where the words are encoded using RNN and gene130 e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the ange in yi,d will be the input variable with highest contribution. More formally, given the x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which pressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of Wemb weighted by each f xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the 61 n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the argest change in yi,d will be the input variable with highest contribution. More formally, given the equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which an be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) where ci 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings 1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each lement of xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi. herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i,
  • 62. RETAIN: Calculating the Contributions n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the argest change in yi,d will be the input variable with highest contribution. More formally, given the equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which an be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) where ci 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings 1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each lement of xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi. herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i, e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the ange in yi,d will be the input variable with highest contribution. More formally, given the x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which pressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of Wemb weighted by each f xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the 62 2.2 Reverse Time Attention Model RETAIN Figure 2 shows the high-level overview of our model. One key idea is the prediction responsibility to the attention weights generation proces due to the recurrent weights feeding past information to the hidden l visit-level and the variable-level (individual coordinates of xi) influen input vector xi. That is, we define vi = Exi, where vi 2 Rm denotes the embedding of the input vector xi 2 Rr , m E 2 Rm⇥r the embedding matrix to learn. We can easily choose a mor representation such as multilayer perceptron (MLP) [13, 28] which has in EHR data [10]. We use two sets of weights for the visit-level attention and the vari scalars ↵1, . . . , ↵i are the visit-level attention weights that govern th v1, . . . , vi. The vectors 1, . . . , i are the variable-level attention weig the visit embedding v1,1, v1,2, . . . , v1,m, . . . , vi,1, vi,2, . . . , vi,m. We use two RNNs, RNN↵ and RNN , to separately generate ↵’s a predict the probability of the output vector yi 2 {0, 1}s , which can be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi, be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(yi, xj,k) = ↵jW( j e:,k) | {z } Contribution coefficient xj,k |{z} Input value , (5) k-th column of E
  • 63. RETAIN: Calculating the Contributions e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the ange in yi,d will be the input variable with highest contribution. More formally, given the x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which pressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of Wemb weighted by each f xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the Inside the iteration over k 63 e in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the largest change in e the input variable with highest contribution. More formally, given the sequence x1, . . . , xi, we are predict the probability of the output vector yi 2 {0, 1}s , which can be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi, n be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(y , x ) = ↵ W( e ) x , (5) n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the argest change in yi,d will be the input variable with highest contribution. More formally, given the equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which an be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) where ci 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings 1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each lement of xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi. herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i, predict the probability of the output vector yi 2 {0, 1}s , which can be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi, be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(yi, xj,k) = ↵jW( j e:,k) | {z } Contribution coefficient xj,k |{z} Input value , (5) Scalars in the front
  • 64. RETAIN: Calculating the Contributions e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the ange in yi,d will be the input variable with highest contribution. More formally, given the x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which pressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of Wemb weighted by each f xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the 64 1 i predict the probability of the output vector yi 2 {0, 1}s , which can be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi, n be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(yi, xj,k) = ↵jW( j e:,k) | {z } Contribution coefficient xj,k |{z} Input value , (5) n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the argest change in yi,d will be the input variable with highest contribution. More formally, given the equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which an be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) where ci 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings 1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each lement of xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi. herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i, predict the probability of the output vector yi 2 {0, 1}s , which can be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi, be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(yi, xj,k) = ↵jW( j e:,k) | {z } Contribution coefficient xj,k |{z} Input value , (5)
  • 65. RETAIN: Calculating the Contributions e a method to interpret the end-to-end behavior of RETAIN. By keeping ↵ and values fixed ntion of doctors, we will analyze the changes in the probability of each label yi,1, . . . , yi,s f the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the ange in yi,d will be the input variable with highest contribution. More formally, given the x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which pressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of Wemb weighted by each f xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the Contribution of the k-th code in the j-th visit 65 p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) e visit embedding vi is the sum of the columns of E weighted by each element of xi, en as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of econstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(yi, xj,k) = ↵jW( j e:,k) | {z } Contribution coefficient xj,k |{z} Input value , (5) i is omitted in the ↵j and j. As we have described in Section 2.2, we are generating 1 i predict the probability of the output vector yi 2 {0, 1}s , which can be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi, n be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(yi, xj,k) = ↵jW( j e:,k) | {z } Contribution coefficient xj,k |{z} Input value , (5) n terms of the change in an original input x1,1, . . . , x1,r, . . . , xi,1, . . . , xi,r. The xj,k that lead to the argest change in yi,d will be the input variable with highest contribution. More formally, given the equence x1, . . . , xi, we are trying to predict the probability of the output vector yi 2 {0, 1}s , which an be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) where ci 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings 1, . . . , vi weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) Using the fact that the visit embedding vi is the sum of the columns of Wemb weighted by each lement of xi, Eq (3) can be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,kWemb[:, k] ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j Wemb[:, k] ⌘ + b ◆ (4) where xj,k is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the kelihood of yi can be completely deconstructed down to the variables at each input x1, . . . , xi. herefore we can calculate the contribution ! of the k-th variable of the input xj at time step j  i, predict the probability of the output vector yi 2 {0, 1}s , which can be expressed as follows p(yi|x1, . . . , xi) = p(yi|ci) = Softmax (Wci + b) (2) 2 Rm denotes the context vector. According to Step 4, ci is the sum of the visit embeddings weighted by the attentions ↵’s and ’s. Therefore Eq (2) can be rewritten as follows, p(yi|x1, . . . , xi) = p(yi|ci) = Softmax ✓ W ⇣ iX j=1 ↵j j vj ⌘ + b ◆ (3) fact that the visit embedding vi is the sum of the columns of E weighted by each element of xi, be rewritten as follows, p(yi|x1, . . . , xi) = Softmax ✓ W ⇣ iX j=1 ↵j j rX k=1 xj,ke:,k ⌘ + b ◆ = Softmax ✓ iX j=1 rX k=1 xj,k ↵jW ⇣ j e:,k ⌘ + b ◆ (4) is the k-th element of the input vector xj. Eq (4) tells us that the calculation of the likelihood of completely deconstructed down to the variables at each input x1, . . . , xi. Therefore we can calculate bution ! of the k-th variable of the input xj at time step j  i, for predicting yi as follows, !(yi, xj,k) = ↵jW( j e:,k) | {z } Contribution coefficient xj,k |{z} Input value , (5)
  • 67. Heart Failure (HF) Prediction • Objective • Given a patient record, predict whether he/she will be diagnosed with HF in the future • 34K patients from Sutter PAMF • 4K cases, 30K controls • Use 18-months history before being diagnosed with HF • Number of medical codes • 283 diagnosis codes • 96 medication codes • 238 procedure codes 67 617 medical codes
  • 68. Heart failure prediction • Performance measure • Area under the ROC curve (AUC) • Competing models • Logistic regression • Aggregate all past codes into a fixed-size vector. Feed it to LR • MLP • Aggregate all past codes into a fixed-size vector. Feed it to MLP • Two-layer RNN • Visits are fed to the RNN, whose hidden layers are fed to another RNN. • RNN+attention (Bahdanau et al. 2014) • Visits are fed to RNN. Visit-level attentions are generated by MLP • RETAIN 68
  • 69. Heart failure prediction Models AUC Training time / epoch Test time for 5K patients Logistic Regression 0.7900 ± 0.0111 0.15s 0.11s MLP 0.8256 ± 0.0096 0.25s 0.11s Two-layer RNN 0.8706 ± 0.0080 10.3s 0.57s RNN+attention 0.8624 ± 0.0079 6.7s 0.48s RETAIN 0.8705 ± 0.0081 10.8s 0.63s • RETAIN as accurate as RNN • Requires similar training time & test time • RETAIN is interpretable! • RNN is a blackbox 69
  • 71. Conclusion • RETAIN: interpretable prediction framework • As accurate as RNN • Interpretable prediction • Predictions can be explained • Can be extended to general prognosis • What are the likely disease he/she will have in the future? • Can be used for any sequences with the two-layer structure • E.g. online shopping 71
  • 74. How to generate the attentions 𝛼.? • Use function a(x) for each word: Justice, League, …, Christmas, play • Feed the scores y1, y2, …, y10 into the Softmax function League playJustice a(x1) y1 a(x2) y2 a(x10) y10 𝛼. = exp ( 𝑦.) ∑ exp ( 𝑦:)#& :;# Christmas a(x9) y9 74
  • 75. How to generate the attentions 𝛼.? • Use function a(x) for each word: Justice, League, …, Christmas, play • Feed the scores y1, y2, …, y10 into the Softmax function League playJustice a(x1) y1 a(x2) y2 a(x10) y10 𝛼. = exp ( 𝑦.) ∑ exp ( 𝑦:)#& :;# Christmas a(x9) y9 Softmax function ensures 𝛼.’s sum to 1 Return75