Cost-effective Interactive Attention Learning with Neural Attention Process

Jay Heo1, Junhyeon Park1, Hyewon Jeong1, Kwang joon Kim2,
Juho Lee3, Eunho Yang1 3, Sung Ju Hwang1 3
Cost-Effective Interactive Attention Learning
with Neural Attention Processes
KAIST1, Yonsei University College of Medicine2, AITRICS3

Model Interpretability
Main Network InferenceInput Data
Training
The complex nature of deep neural networks has led to the recent surge of interests
in interpretable models which provide model interpretations.

Main Network InferenceInput Data Model Interpretation
Interpretation tool
Inference

Main Network InferenceInput Data Model Interpretation
Interpretation tool
Inference
Provide explanations for model’s decision.

Challenge: Incorrect & Unreliable Interpretation
Not all machine-generated interpretations are correct or human-understandable.
• Correctness and reliability of a learning model heavily depends on quality and
quantity of training data.
• Neural networks tend to learn non-robust features that help with predictions, but
are not human-perceptible.
Model Interpretation
Quality of training data
Quantity of training data

Whether a model learn
too many non-robust
features during training?

1. Is it correct?
2. Is it understandable
enough to trust?
Whether a model learn
too many non-robust
features during training?

Interactive Learning Framework
Propose an interactive learning framework which iteratively update the model by
interacting with the human supervisors who adjust the provided interpretations.
• Actively use human supervisors as a channel for human-model communications.
Attentional Networks Human Annotator
Learning
Model
Physician
Interpretation
(Attentions)
Model’s decision
0.80.60.3
Low
uncertainty
High
uncertainty
High
uncertainty
: Attention

Interactive Learning Framework
Propose an interactive learning framework which iteratively update the model by
interacting with the human supervisors who adjust the provided interpretations.
• Actively use human supervisors as a channel for human-model communications.
Attentional Networks Human Annotator
Learning
Model
Physician
Interpretation
(Attentions)
Model’s decision
Annotation
AnnotateRetrain
0.80.60.3
Low
uncertainty
High
uncertainty
High
uncertainty
: Attention

Challenge: Model Retraining Cost
To reflect human feedback, the model needs to be retrained, which is costly.
• Retraining the model with scarce human feedback may result in the model overfitting
Physician
Learning Model
Annotated Examples
0.80.60.3
Low
uncertainty
High
uncertainty
High
uncertainty
: Attention
Interpretation

Physician
Learning Model
Annotated Examples
0.80.60.3
Low
uncertainty
High
uncertainty
High
uncertainty
: Attention
Interpretation
Scarce feedback

Physician
Learning Model
Annotated Examples
0.80.60.3
Low
uncertainty
High
uncertainty
High
uncertainty
: Attention
Interpretation
Scarce feedback
Retraining

Physician
Learning Model
Annotated Examples
0.80.60.3
Low
uncertainty
High
uncertainty
High
uncertainty
: Attention
Interpretation
Scarce feedback
Retraining
!
Overfitting

Challenge: Expensive Human Supervision Cost
Obtaining human feedback on datasets with large numbers of training instances and
features is extremely costly.
• Obtaining feedback on already correct or previously corrected interpretations is
wasteful.
Annotator
Annotate
𝑴 𝒂𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏
(𝒕)
∈{0, 1}
Big data

wasteful.
Annotator
Annotate
Big data
Annotate

wasteful.
Annotator
Annotate
Big data
Costly
Annotate
!

Interactive Attention Learning Framework
Domain experts interactively evaluate learned attentions and provide feedbacks to
obtain models that generate human-intuitive interpretations.
Learning
Model
Physician
Attention
Mechanism
LDL
Respiration
Cholesterol
Creatine
BMI
Diabetes
Heart
Failure
Hypertension
Deliver
attention
Annotate attention
Attention0.80.60.3
: Attention
(𝒕)
∈ {0, 1}
Influence
Function
MC Dropout
Annotation Mask
Deep
Interpretation
Tool
Correlation & Causal relationship Analysis
Manipulate
Explain
Neural
Attention
Processes
Counterfactual
Estimation
Counterfactual
Estimation
a
c
b
Granger Causality

Learning
Model
Physician
Attention
Mechanism
1. Neural Attention
Processes
LDL
Respiration
Cholesterol
Creatine
BMI
Diabetes
Heart
Failure
Hypertension
Deliver
attention
Annotate attention
Attention0.80.60.3
: Attention
(𝒕)
∈ {0, 1}
Influence
Function
MC Dropout
Annotation Mask
Deep
Interpretation
Tool
Manipulate
Explain
Neural
Attention
Processes
Counterfactual
Estimation
Counterfactual
Estimation
a
c
b
Granger Causality

Learning
Model
Physician
Attention
Mechanism
LDL
Respiration
Cholesterol
Creatine
BMI
Diabetes
Heart
Failure
Hypertension
Deliver
attention
Annotate attention
Attention0.80.60.3
: Attention
(𝒕)
∈ {0, 1}
Influence
Function
MC Dropout
Annotation Mask
Deep
Interpretation
Tool
Manipulate
Explain
Neural
Attention
Processes
Counterfactual
Estimation
1. Neural Attention
Processes
2. Cost-effective Reranking
Counterfactual
Estimation
a
c
b
Granger Causality

Neural Attention Processes (NAP)
NAP naturally reflects the information from the annotation summarization z via amor
tized inference.
Domain
Expert
Annotation
Neural Attention Processes
• NAP learns to summarize delivered annotations to a latent vector, and gives th
e summarization as an additional input to the attention generating network.

!
Attention
Context Points
"!
!! "! #!
!!
New Observations
!!"# "!"# !!"#
NAP minimizes retraining cost by incorporating new labeled instances without retrai
ning and overfitting.
First Round (s=1)
• NAP doesn’t require retraining for further new observations, in that NAP
automatically adapt to them at the cost of a forward pass through a network 𝑔!.

NAP minimizes retraining cost by incorporating new labeled instances without retrain
ing and overfitting.
• NAP is trained in a meta-learning fashion for few-shot function estimation, where it
is trained to predict the attention mask of other labeled samples, given a randomly
selected labeled set as context.
!
Attention
"! !!
New Observations
!!"# "!"# !!"#
Context Points
!!
"! #!!!"$ "!"$ !!"$
Further Rounds (s=2,3,4)

Cost-Effective Instance & Features Reranking (CER)
Address the expensive human labeling cost by reranking the instances, features, and
timesteps (for time-series data) by their negative impacts.
Attentional
Network
Domain
Expert
Instance-wise and Feature-wise Reranking

Attentional
Network
Domain
Expert
𝑷
𝑲
…
…
…TrainTrainValid Valid
Estimate
𝑰(𝒖!) / 𝐕𝐚𝐫(𝒖!)
Instance-level Reranking
Select Re-rank &
select

Attentional
Network
Domain
Expert
𝑷
𝑲
…
…
…TrainTrainValid Valid
Estimate
𝑰(𝒖!) / 𝐕𝐚𝐫(𝒖!)
Instance-level Reranking
Train
Feature-level Reranking
Select Re-rank &
select
Estimate
𝑰(𝒖",$
(&)
) / 𝐕𝐚𝒓 (𝒖",$
(&)
) / 𝜓(𝒖",$
(&)
)
Feature
…
…
Feature
𝑭
Re-rank &
select

CER: 1. Influence Score
Use the influence function (Koh & Liang, 2017) to approximate the impact of
individual training points on the model prediction.
Fish
Dog
Dog
“Dog”
Training
Training data Test Input
[Koh and Liang 17] Understanding Black-box Predictions via Influence Functions, ICML 2017

CER: 2. Uncertainty Score
Measure the negative impacts using the uncertainty which can be measured by Mon
te-Carlo sampling (Gal & Ghahramani, 2016).
Uncertainty-aware Attention Mechanism
• Less expensive approach to measure the negative impacts.
• Assume that instances having high-predictive uncertainties are potential
candidate to be corrected.
Uncertainty
[Jay Heo*, Haebeom Lee*, Saehun Kim,, Juho Lee, Gwangjun Kim , Eunho Yang, Sung Joo Hwang] Uncertainty-aware Attention Mechanism for Reliable prediction and Interpretation, Neurips 2018.

CER: 2. Uncertainty Score
Measure the negative impacts using the uncertainty which can be measured by Mon
te-Carlo sampling (Gal & Ghahramani, 2016).
Measure instance-wise &
feature-wise uncertainty
0.80.60.3
Low
uncertainty
High
uncertainty
High
uncertainty
µ σ, )(N
SpO2
Pulse
Respiration
• Less expensive approach to measure the negative impacts.
• Assume that instances having high-predictive uncertainties are potential
candidate to be corrected.
Uncertainty-aware Attention Mechanism
[Jay Heo*, Haebeom Lee*, Saehun Kim,, Juho Lee, Gwangjun Kim , Eunho Yang, Sung Joo Hwang] Uncertainty-aware Attention Mechanism for Reliable prediction and Interpretation, Neurips 2018.

CER: 3. Counterfactual Score
How would the prediction change if we ignore a certain feature by manually turning
on/off the corresponding attention value?
• Not need for retraining since we can simply set its attention value to zero.
• Used to rerank the features with regards to their importance.
Counterfactual Estimation Interface

Experimental setting – Datasets
Use electronic health records, real estate sales transaction records, and exercise squat
posture correction records for classification and regression tasks.
1. EHR Datasets
1) Cerebral Infarction
2) Cardio Vascular Disease
3) Heart Failure
Binary Classification task

1. EHR Datasets
3) Heart Failure
2. Real-estate dataset
1) Housing price forecasting
Binary Classification task Regression task

1. EHR Datasets
3) Heart Failure
2. Real-estate dataset 3. Squat Posture set
1) Housing price forecasting
1) Squat posture correction
Binary Classification task Regression task Multi-label
Classification task

Attention Evaluation Interface – Risk Prediction
Domain experts evaluate the delivered attention 𝛼 𝑎𝑛𝑑 𝜷 via attention annotation
mask 𝑴 𝒕
𝜶
∈ 0, 1 $×& and 𝑴 𝒕
𝜷
∈ 0, 1 $×(×& .
Attention Annotation Platform
(Risk Prediction Task with Counterfactual Estimation)

Real estate in NYC
mask 𝑴 𝒕
𝜶
∈ 0, 1 $×& and 𝑴 𝒕
𝜷
∈ 0, 1 $×(×& .
(Real estate Price Forecasting in New York City)

Action Posture Correction Task
mask 𝑴 𝒕
𝜶
∈ 0, 1 $×& and 𝑴 𝒕
𝜷
∈ 0, 1 $×(×& .
(Action Posture Correction Task)

Experiment Results
Conducted experiments on three risk prediction tasks, one fitness squat, and one real
estate forecasting task.
EHR Fitness
Squat
Real Estate
ForecastingHeart Failure Cerebral Infarction CVD
One-time
Training
RETAIN 0.6069 ± 0.01 0.6394 ± 0.02 0.6018 ± 0.02 0.8425 ± 0.03 0.2136 ± 0.01
Random-RETAIN 0.5952 ± 0.02 0.6256 ± 0.02 0.5885 ± 0.01 0.8221 ± 0.05 0.2140 ± 0.01
IF-RETAIN 0.6134 ± 0.03 0.6422 ± 0.02 0.5882 ± 0.02 0.8363 ± 0.03 0.2049 ± 0.01
Random
Re-ranking
Random-UA 0.6231 ± 0.03 0.6491 ± 0.01 0.6112 ± 0.02 0.8521 ± 0.02 0.2222 ± 0.02
Random-NAP 0.6414 ± 0.01 0.6674 ± 0.02 0.6284 ± 0.01 0.8525 ± 0.01 0.2061 ± 0.01
IAL
(Cost-effective)
AILA 0.6363 ± 0.03 0.6602 ± 0.03 0.6193 ± 0.02 0.8425 ± 0.01 0.2119 ± 0.01
IAL-NAP 0.6612 ± 0.02 0.6892 ± 0.03 0.6371 ± 0.02 0.8689 ± 0.01 0.1835 ± 0.01

Experiment Results
EHR Fitness
Squat
Real Estate
One-time
Training
RETAIN 0.6069 ± 0.01 0.6394 ± 0.02 0.6018 ± 0.02 0.8425 ± 0.03 0.2136 ± 0.01
Random-RETAIN 0.5952 ± 0.02 0.6256 ± 0.02 0.5885 ± 0.01 0.8221 ± 0.05 0.2140 ± 0.01
IF-RETAIN 0.6134 ± 0.03 0.6422 ± 0.02 0.5882 ± 0.02 0.8363 ± 0.03 0.2049 ± 0.01
Random
Re-ranking
Random-UA 0.6231 ± 0.03 0.6491 ± 0.01 0.6112 ± 0.02 0.8521 ± 0.02 0.2222 ± 0.02
Random-NAP 0.6414 ± 0.01 0.6674 ± 0.02 0.6284 ± 0.01 0.8525 ± 0.01 0.2061 ± 0.01
IAL
(Cost-effective)
AILA 0.6363 ± 0.03 0.6602 ± 0.03 0.6193 ± 0.02 0.8425 ± 0.01 0.2119 ± 0.01
IAL-NAP 0.6612 ± 0.02 0.6892 ± 0.03 0.6371 ± 0.02 0.8689 ± 0.01 0.1835 ± 0.01
• Random-UA, which is retrained with human attention-level supervision on randomly selected.
samples, performs worse than Random-NAP.

Experiment Results
EHR Fitness
Squat
Real Estate
One-time
Training
RETAIN 0.6069 ± 0.01 0.6394 ± 0.02 0.6018 ± 0.02 0.8425 ± 0.03 0.2136 ± 0.01
Random-RETAIN 0.5952 ± 0.02 0.6256 ± 0.02 0.5885 ± 0.01 0.8221 ± 0.05 0.2140 ± 0.01
IF-RETAIN 0.6134 ± 0.03 0.6422 ± 0.02 0.5882 ± 0.02 0.8363 ± 0.03 0.2049 ± 0.01
Random
Re-ranking
Random-UA 0.6231 ± 0.03 0.6491 ± 0.01 0.6112 ± 0.02 0.8521 ± 0.02 0.2222 ± 0.02
Random-NAP 0.6414 ± 0.01 0.6674 ± 0.02 0.6284 ± 0.01 0.8525 ± 0.01 0.2061 ± 0.01
IAL
(Cost-effective)
AILA 0.6363 ± 0.03 0.6602 ± 0.03 0.6193 ± 0.02 0.8425 ± 0.01 0.2119 ± 0.01
IAL-NAP 0.6612 ± 0.02 0.6892 ± 0.03 0.6371 ± 0.02 0.8689 ± 0.01 0.1835 ± 0.01
• Random-UA, which is retrained with human attention-level supervision on randomly selected.
samples, performs worse than Random-NAP.
• IAL-NAP significantly outperforms Random-NAP, showing that the effect of attention annotation
cannot have much effect on the model when the instances are randomly selected.

Experiment Results
Results of Ablation study with proposed IAL-NAP combinations for instance- and feat
ure-level reranking on all tasks.
IAL-NAP Variants EHR Fitness
Squat
Real Estate
ForecastingInstance-level Feature-level Heart Failure Cerebral Infarction CVD
Influence Function Uncertainty 0.6563 ± 0.01 0.6821 ± 0.02 0.6308 ± 0.02 0.8712 ± 0.01 0.1921 ± 0.01
Influence Function Influence Function 0.6514 ± 0.02 0.6825 ± 0.01 0.6329 ± 0.03 0.8632 ± 0.01 0.1865 ± 0.02
Influence Function Counterfactual 0.6592 ± 0.02 0.6921 ± 0.03 0.6379 ± 0.02 0.8682 ± 0.01 0.1863 ± 0.02
Uncertainty Counterfactual 0.6612 ± 0.01 0.6892 ± 0.03 0.6371 ± 0.02 0.8689 ± 0.02 0.1835 ± 0.02
• For instance-level scoring, influence and uncertainty scores work similarly
Ablation study with IAL-NAP combinations

Experiment Results
Squat
Real Estate
• For instance-level scoring, influence and uncertainty scores work similarly, while the counterfactual
score was the most effective for feature-wise reranking.

Experiment Results
Squat
Real Estate
• For instance-level scoring, influence and uncertainty scores work similarly, while the counterfactual
score was the most effective for feature-wise reranking.
• The combination of uncertainty-counterfactual is the most cost-effective solution since it avoids ex
pensive computation of the Hessians.

Effect of Neural Attention Processes
Retraining time to retrain examples of human annotation and Mean Response Time
of human annotations on the risk prediction tasks.
(a) Heart Failure (b) Cerebral Infarction (c) CVD

Effect of Cost-Effective Reranking
Change of accuracy with 100 annotations across four rounds (S) between IAL-NAP (bl
ue) vs Random-NAP (red).
(b) Cerebral Infarction(a) Heart Failure (c) CVD (d) Squat
• IAL-NAP uses a smaller number of annotated examples (100 examples) than Rando
m-NAP (400 examples) to improve the model with the comparable accuracy (auc: 0.
6414).

Qualitative Analysis – Risk Prediction
Further analyze the contribution of each feature for a CVD patient (label=1).
A patient records
for Cardio Vascular
Disease
• At s=3, IAL allocated more attention weights on the important feature (Smoking), which the
initial training model missed to attend.
• Age
• Smoking : Whether smoke or
not
• SysBP : Systolic Blood Pressure
• HDL : High-density Lipoprotein
• LDL : Low-density Lipoprotein
(a) Pretrained (b) s=1 (c) s=2

Qualitative Analysis – Risk Prediction
Further analyze the contribution of each feature for a CVD patient (label=1).
A patient records
for Cardio Vascular
Disease
• At s=3, IAL allocated more attention weights on the important feature (Smoking), which the
initial training model missed to attend.
à Clinicians guided the model to learn it since smoking is a key factor to access CVD.
• Age
• Smoking : Whether smoke or
not
• SysBP : Systolic Blood Pressure
• HDL : High-density Lipoprotein
• LDL : Low-density Lipoprotein
(a) Pretrained (b) s=1 (c) s=2

Summary
Propose a novel interactive learning framework which iteratively updates the model
by interacting with the human supervisor via the generated attentions.
• Unlike conventional active learning, IAL allows the human annotators to “actively” interpret,
manipulate the model’s behavior, and see its effect. IAL allows for online learning without retrai
ning the main network, by training a “separate” attention generator.

Summary
• Neural Attention Processes is a novel attention mechanism which can generate attention on unla
beled instances given few labeled samples, and can incorporate new labeled instances without
retraining and overfitting.

Summary
• Neural Attention Processes is a novel attention mechanism which can generate attention on unla
beled instances given few labeled samples, and can incorporate new labeled instances without
retraining and overfitting.
• Our reranking strategy re-ranks the instance and features, which substantially reduces the annot
ation cost and time for high-dimensional inputs.

Cost-effective Interactive Attention Learning with Neural Attention Process

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cost-effective Interactive Attention Learning with Neural Attention Process

Similar to Cost-effective Interactive Attention Learning with Neural Attention Process (20)

More from MLAI2

More from MLAI2 (20)

Recently uploaded

Recently uploaded (20)

Cost-effective Interactive Attention Learning with Neural Attention Process