BERTology meets Biology

BERTology Meets Biology:
Interpreting Attention in
Protein Language Models
https://arxiv.org/abs/2006.15222
https://github.com/salesforce/provis

BERT: masked language model
https://jalammar.github.io/illustrated-bert/

BERTology
There is a lot of research targeting at BERT representations.
One of the recent works is:
A Primer in BERTology: What we know about how BERT works

Idea
Through the lens of attention, to analyze the inner workings of the Transformer
and explore how the model discerns structural and functional properties of
proteins.

Data
Pretrained BERT-Base from TAPE repository.
Authors DID NOT TRAIN THEIR OWN MODEL.

Evaluating Protein Transfer
Learning with TAPE
https://github.com/songlab-cal/tape

TAPE: Tasks Assessing Protein Embeddings
A set of ﬁve biologically relevant supervised tasks that evaluate the
performance of learned protein embeddings across diverse aspects of protein
understanding (self-supervised pretraining + supervised training):
● Secondary Structure Prediction (Structure prediction task)
● Contact Prediction (Structure prediction task)
● Remote Homology Detection (Evolutionary Understanding task)
● Fluorescence Landscape Prediction (Protein Engineering task)
● Stability Landscape Prediction (Protein Engineering task)

Pre-training Data
Pre-training corpus:
● Pfam: 31M protein domains.
○ Test set: fully heldout families (1% data)
○ Remaining data splitted 95%/5%

Tasks
Language modeling (predict next token):
Masked language modeling (predict masked token):
https://blog.einstein.ai/provis/

Losses
Self-supervised losses:
● Next-token prediction + reverse model (like in language model + reverse)
● Masked-token prediction (like in BERT)
Protein-speciﬁc loss in a separate experiment:
● Contact prediction + remote homology detection

Models
Models:
● LSTM (next-token pred.)
● Transformer encoder 12L-512D-8A, 38M params (masked-token pred.)
● Dilated ResNet (masked-token pred.)

Pretrained models
Currently available pretrained models are:
● bert-base (Transformer encoder model,
https://github.com/songlab-cal/tape/issues/39)
(this model is used in the BERTology/Biology paper)
● babbler-1900 (UniRep model)
● xaa, xab, xac, xad, xae (trRosetta model)

Summary on Data/Model
BERTology/Biology paper uses:
● Transformer encoder (BERT-like 12L-512D-8(12?)A, 38M params)
● Trained on Pfam data
● Using Masked-token prediction (like MLM task in BERT)

Back to BERTology/Biology
“We use the BERT-Base model from the TAPE repository, which was pretrained on
masked language modeling of amino acids over a dataset of 31 million protein
sequences”
“The BERT-Base model has 12 layers and 12 heads, yielding a total of 144 distinct
attention mechanism”
(mismatch, in the TAPE paper there were only 8 heads)

Data
Using two datasets from TAPE for the analysis (not training):
● ProteinNet dataset (task 2, contact maps).
Used for amino acids analysis & contact maps
● Secondary Structure dataset (task 1).
Used for analysis of secondary structure & binding sites
Plus a token-level binding site annotations from PDB.
“For analyzing attention, we used a random subset of 5000 sequences from the
respective training splits as this analysis was purely evaluative; for training the
probing classiﬁer, we used the full training splits for training the model, and the
validation splits for evaluation.”

Attention is not an explanation!
Attention is not Explanation, https://arxiv.org/abs/1902.10186
“ In this work, we perform extensive experiments across a variety of NLP tasks that
aim to assess the degree to which attention weights provide meaningful
èxplanations' for predictions. We find that they largely do not. For example,
learned attention weights are frequently uncorrelated with gradient-based
measures of feature importance, and one can identify very different attention
distributions that nonetheless yield equivalent predictions. Our findings show that
standard attention modules do not provide meaningful explanations and should
not be treated as though they do.”

Why Attention is Not Explanation: Surgical Intervention and Causal
Reasoning about Neural Models,
https://www.aclweb.org/anthology/2020.lrec-1.220/
“ From this analysis, we assert the impossibility of causal explanations from
attention layers over text data. We then introduce NLP researchers to
contemporary philosophy of science theories that allow robust yet non-causal
reasoning in explanation, giving computer scientists a vocabulary for future
research.”

Learning to Deceive with Attention-Based Explanations,
“We call the latter use of attention mechanisms into question by demonstrating a
simple method for training models to produce deceptive attention masks.
Our method diminishes the total weight assigned to designated impermissible
tokens, even when the models can be shown to nevertheless rely on these features
to drive predictions. ... Consequently, our results cast doubt on attention's
reliability as a tool for auditing algorithms in the context of fairness and
accountability.”

Yet...
Attention is not not Explanation, https://arxiv.org/abs/1908.04626
“A recent paper claims that `Attention is not Explanation' (Jain and Wallace, 2019).
... We propose four alternative tests to determine when/whether attention can be
used as explanation: a simple uniform-weights baseline; a variance calibration
based on multiple random seed runs; a diagnostic framework using frozen weights
from pretrained models; and an end-to-end adversarial attention training
protocol. Each allows for meaningful interpretation of attention mechanisms in
RNN models. We show that even when reliable adversarial distributions can be
found, they don't perform well on the simple diagnostic, indicating that prior
work does not disprove the usefulness of attention mechanisms for
explainability.”

Attention heads specialize in certain types of aa
“We computed the proportion of attention that each head focuses on particular
types of amino acids, averaged over a dataset of 5000 sequences with a combined
length of 1,067,712 amino acids.
We found that for 14 of the 20 types of amino acids, there exists an attention
head that focuses over 25% of attention on that amino acid.
For example, Figure 2 shows that head 1-11 focuses 78% of its total attention on
the amino acid Pro and head 12-3 focuses 27% of attention on Phe.
Note that the maximum frequency of any single type of amino acid in the
dataset is 9.4%.”

Attention heads specialize in certain types of aa

Attention similarity vs. BLOSUM
“We analyze how the attention received by amino acids relates to an existing
measure of structural and functional properties: the substitution matrix. We assess
whether attention tracks similar properties by computing the similarity of attention
between each pair of amino acids and then comparing this metric to the pairwise
similarity based on the substitution matrix. To measure attention similarity, we
compute the Pearson correlation between the proportion of attention that
each amino acid receives across heads.
For example, to measure the attention similarity between Pro and Phe, we take the
Pearson correlation of the two heatmaps in Figure 2. The values of all such
pairwise correlations are shown in Figure 3a. We compare these scores to the
BLOSUM scores in Figure 3b, and ﬁnd a Pearson correlation of 0.80, suggesting
that attention is largely consistent with substitution relationships.”

Attention similarity vs. BLOSUM

Attention aligns strongly with contact maps
“Figure 4 shows the percentage of each head’s attention that aligns with contact
maps.
A single head, 12-4, aligns much more strongly with contact maps (28% of
attention) than any of the other heads (maximum 7% of attention). In cases where
the attention weight in head 12-4 is greater than 0.9, the alignment increases to
76%.
In contrast, the frequency of contact pairs among all token pairs in the dataset is
1.3%.”

Attention aligns strongly with contact maps

Example protein and the induced attn from h12-4

Binding sites
“Figure 6 shows the proportion of
attention focused on binding sites
by each head.
In most layers, the mean percentage
across heads is signiﬁcantly higher
than the background frequency of
binding sites (4.8%).
The eﬀect is strongest in the last 6
layers of the model, which include
15 heads that each focus over 20%
of their attention on binding sites.”

“Head 7-1 focuses the most attention
on binding sites (34%).”

Attention targets binding sites
“Figure 7 shows the estimated
probability of head 7-1 targeting a
binding site, as a function of the
attention weight.
We also ﬁnd that tokens often target
binding sites from far away in the
sequence. In Head 7-1, for example,
the average distance spanned by
attention to binding sites is 124
tokens.”

Attention targets higher-level props in deeper layers
“As shown in Figure 8, deeper layers
focus relatively more attention on
binding sites and contacts (high-level
concept), whereas secondary structure
(low- to mid-level concept) is targeted
more evenly across layers.”

“The probing analysis (Figure 9) similarly shows
that the model ﬁrst forms representations of
secondary structure before fully encoding
contact maps and binding sites. ”

https://ru.linkedin.com/in/grigorysapunov
grigory.sapunov@ieee.org
Thanks!

BERTology meets Biology

More Related Content

What's hot

Similar to BERTology meets Biology

More from Grigory Sapunov

Recently uploaded

BERTology meets Biology