BERTology Meets Biology:
Interpreting Attention in
Protein Language Models
https://arxiv.org/abs/2006.15222
https://github.com/salesforce/provis
BERT: masked language model
https://jalammar.github.io/illustrated-bert/
BERTology
There is a lot of research targeting at BERT representations.
One of the recent works is:
A Primer in BERTology: What we know about how BERT works
https://arxiv.org/abs/2002.12327
Idea
Through the lens of attention, to analyze the inner workings of the Transformer
and explore how the model discerns structural and functional properties of
proteins.
Data
Pretrained BERT-Base from TAPE repository.
Authors DID NOT TRAIN THEIR OWN MODEL.
Evaluating Protein Transfer
Learning with TAPE
https://arxiv.org/abs/1906.08230
https://github.com/songlab-cal/tape
TAPE: Tasks Assessing Protein Embeddings
A set of five biologically relevant supervised tasks that evaluate the
performance of learned protein embeddings across diverse aspects of protein
understanding (self-supervised pretraining + supervised training):
● Secondary Structure Prediction (Structure prediction task)
● Contact Prediction (Structure prediction task)
● Remote Homology Detection (Evolutionary Understanding task)
● Fluorescence Landscape Prediction (Protein Engineering task)
● Stability Landscape Prediction (Protein Engineering task)
Pre-training Data
Pre-training corpus:
● Pfam: 31M protein domains.
○ Test set: fully heldout families (1% data)
○ Remaining data splitted 95%/5%
Pre-training Data
Data
Tasks
Language modeling (predict next token):
Masked language modeling (predict masked token):
https://blog.einstein.ai/provis/
Losses
Self-supervised losses:
● Next-token prediction + reverse model (like in language model + reverse)
● Masked-token prediction (like in BERT)
Protein-specific loss in a separate experiment:
● Contact prediction + remote homology detection
Models
Models:
● LSTM (next-token pred.)
● Transformer encoder 12L-512D-8A, 38M params (masked-token pred.)
● Dilated ResNet (masked-token pred.)
Pretrained models
Currently available pretrained models are:
● bert-base (Transformer encoder model,
https://github.com/songlab-cal/tape/issues/39)
(this model is used in the BERTology/Biology paper)
● babbler-1900 (UniRep model)
● xaa, xab, xac, xad, xae (trRosetta model)
Results
Summary on Data/Model
BERTology/Biology paper uses:
● Transformer encoder (BERT-like 12L-512D-8(12?)A, 38M params)
● Trained on Pfam data
● Using Masked-token prediction (like MLM task in BERT)
Back to BERTology/Biology
“We use the BERT-Base model from the TAPE repository, which was pretrained on
masked language modeling of amino acids over a dataset of 31 million protein
sequences”
“The BERT-Base model has 12 layers and 12 heads, yielding a total of 144 distinct
attention mechanism”
(mismatch, in the TAPE paper there were only 8 heads)
The core of the analysis
Analysis
Data
Using two datasets from TAPE for the analysis (not training):
● ProteinNet dataset (task 2, contact maps).
Used for amino acids analysis & contact maps
● Secondary Structure dataset (task 1).
Used for analysis of secondary structure & binding sites
Plus a token-level binding site annotations from PDB.
“For analyzing attention, we used a random subset of 5000 sequences from the
respective training splits as this analysis was purely evaluative; for training the
probing classifier, we used the full training splits for training the model, and the
validation splits for evaluation.”
Data
ЗАСАДА!
Attention is not an explanation!
Attention is not Explanation, https://arxiv.org/abs/1902.10186
“ In this work, we perform extensive experiments across a variety of NLP tasks that
aim to assess the degree to which attention weights provide meaningful
`explanations' for predictions. We find that they largely do not. For example,
learned attention weights are frequently uncorrelated with gradient-based
measures of feature importance, and one can identify very different attention
distributions that nonetheless yield equivalent predictions. Our findings show that
standard attention modules do not provide meaningful explanations and should
not be treated as though they do.”
Attention is not an explanation!
Why Attention is Not Explanation: Surgical Intervention and Causal
Reasoning about Neural Models,
https://www.aclweb.org/anthology/2020.lrec-1.220/
“ From this analysis, we assert the impossibility of causal explanations from
attention layers over text data. We then introduce NLP researchers to
contemporary philosophy of science theories that allow robust yet non-causal
reasoning in explanation, giving computer scientists a vocabulary for future
research.”
Attention is not an explanation!
Learning to Deceive with Attention-Based Explanations,
https://arxiv.org/abs/1909.07913
“We call the latter use of attention mechanisms into question by demonstrating a
simple method for training models to produce deceptive attention masks.
Our method diminishes the total weight assigned to designated impermissible
tokens, even when the models can be shown to nevertheless rely on these features
to drive predictions. ... Consequently, our results cast doubt on attention's
reliability as a tool for auditing algorithms in the context of fairness and
accountability.”
Yet...
Attention is not not Explanation, https://arxiv.org/abs/1908.04626
“A recent paper claims that `Attention is not Explanation' (Jain and Wallace, 2019).
... We propose four alternative tests to determine when/whether attention can be
used as explanation: a simple uniform-weights baseline; a variance calibration
based on multiple random seed runs; a diagnostic framework using frozen weights
from pretrained models; and an end-to-end adversarial attention training
protocol. Each allows for meaningful interpretation of attention mechanisms in
RNN models. We show that even when reliable adversarial distributions can be
found, they don't perform well on the simple diagnostic, indicating that prior
work does not disprove the usefulness of attention mechanisms for
explainability.”
НАХОДКИ
Attention heads specialize in certain types of aa
“We computed the proportion of attention that each head focuses on particular
types of amino acids, averaged over a dataset of 5000 sequences with a combined
length of 1,067,712 amino acids.
We found that for 14 of the 20 types of amino acids, there exists an attention
head that focuses over 25% of attention on that amino acid.
For example, Figure 2 shows that head 1-11 focuses 78% of its total attention on
the amino acid Pro and head 12-3 focuses 27% of attention on Phe.
Note that the maximum frequency of any single type of amino acid in the
dataset is 9.4%.”
Attention heads specialize in certain types of aa
Attention similarity vs. BLOSUM
“We analyze how the attention received by amino acids relates to an existing
measure of structural and functional properties: the substitution matrix. We assess
whether attention tracks similar properties by computing the similarity of attention
between each pair of amino acids and then comparing this metric to the pairwise
similarity based on the substitution matrix. To measure attention similarity, we
compute the Pearson correlation between the proportion of attention that
each amino acid receives across heads.
For example, to measure the attention similarity between Pro and Phe, we take the
Pearson correlation of the two heatmaps in Figure 2. The values of all such
pairwise correlations are shown in Figure 3a. We compare these scores to the
BLOSUM scores in Figure 3b, and find a Pearson correlation of 0.80, suggesting
that attention is largely consistent with substitution relationships.”
Attention similarity vs. BLOSUM
Attention aligns strongly with contact maps
“Figure 4 shows the percentage of each head’s attention that aligns with contact
maps.
A single head, 12-4, aligns much more strongly with contact maps (28% of
attention) than any of the other heads (maximum 7% of attention). In cases where
the attention weight in head 12-4 is greater than 0.9, the alignment increases to
76%.
In contrast, the frequency of contact pairs among all token pairs in the dataset is
1.3%.”
Attention aligns strongly with contact maps
Example protein and the induced attn from h12-4
Binding sites
“Figure 6 shows the proportion of
attention focused on binding sites
by each head.
In most layers, the mean percentage
across heads is significantly higher
than the background frequency of
binding sites (4.8%).
The effect is strongest in the last 6
layers of the model, which include
15 heads that each focus over 20%
of their attention on binding sites.”
“Head 7-1 focuses the most attention
on binding sites (34%).”
Attention targets binding sites
“Figure 7 shows the estimated
probability of head 7-1 targeting a
binding site, as a function of the
attention weight.
We also find that tokens often target
binding sites from far away in the
sequence. In Head 7-1, for example,
the average distance spanned by
attention to binding sites is 124
tokens.”
Attention targets higher-level props in deeper layers
“As shown in Figure 8, deeper layers
focus relatively more attention on
binding sites and contacts (high-level
concept), whereas secondary structure
(low- to mid-level concept) is targeted
more evenly across layers.”
“The probing analysis (Figure 9) similarly shows
that the model first forms representations of
secondary structure before fully encoding
contact maps and binding sites. ”
Discussions
https://ru.linkedin.com/in/grigorysapunov
grigory.sapunov@ieee.org
Thanks!

BERTology meets Biology

  • 1.
    BERTology Meets Biology: InterpretingAttention in Protein Language Models https://arxiv.org/abs/2006.15222 https://github.com/salesforce/provis
  • 2.
    BERT: masked languagemodel https://jalammar.github.io/illustrated-bert/
  • 3.
    BERTology There is alot of research targeting at BERT representations. One of the recent works is: A Primer in BERTology: What we know about how BERT works https://arxiv.org/abs/2002.12327
  • 4.
    Idea Through the lensof attention, to analyze the inner workings of the Transformer and explore how the model discerns structural and functional properties of proteins.
  • 5.
    Data Pretrained BERT-Base fromTAPE repository. Authors DID NOT TRAIN THEIR OWN MODEL.
  • 7.
    Evaluating Protein Transfer Learningwith TAPE https://arxiv.org/abs/1906.08230 https://github.com/songlab-cal/tape
  • 8.
    TAPE: Tasks AssessingProtein Embeddings A set of five biologically relevant supervised tasks that evaluate the performance of learned protein embeddings across diverse aspects of protein understanding (self-supervised pretraining + supervised training): ● Secondary Structure Prediction (Structure prediction task) ● Contact Prediction (Structure prediction task) ● Remote Homology Detection (Evolutionary Understanding task) ● Fluorescence Landscape Prediction (Protein Engineering task) ● Stability Landscape Prediction (Protein Engineering task)
  • 16.
    Pre-training Data Pre-training corpus: ●Pfam: 31M protein domains. ○ Test set: fully heldout families (1% data) ○ Remaining data splitted 95%/5%
  • 17.
  • 18.
  • 19.
    Tasks Language modeling (predictnext token): Masked language modeling (predict masked token): https://blog.einstein.ai/provis/
  • 20.
    Losses Self-supervised losses: ● Next-tokenprediction + reverse model (like in language model + reverse) ● Masked-token prediction (like in BERT) Protein-specific loss in a separate experiment: ● Contact prediction + remote homology detection
  • 21.
    Models Models: ● LSTM (next-tokenpred.) ● Transformer encoder 12L-512D-8A, 38M params (masked-token pred.) ● Dilated ResNet (masked-token pred.)
  • 22.
    Pretrained models Currently availablepretrained models are: ● bert-base (Transformer encoder model, https://github.com/songlab-cal/tape/issues/39) (this model is used in the BERTology/Biology paper) ● babbler-1900 (UniRep model) ● xaa, xab, xac, xad, xae (trRosetta model)
  • 23.
  • 24.
    Summary on Data/Model BERTology/Biologypaper uses: ● Transformer encoder (BERT-like 12L-512D-8(12?)A, 38M params) ● Trained on Pfam data ● Using Masked-token prediction (like MLM task in BERT)
  • 26.
    Back to BERTology/Biology “Weuse the BERT-Base model from the TAPE repository, which was pretrained on masked language modeling of amino acids over a dataset of 31 million protein sequences” “The BERT-Base model has 12 layers and 12 heads, yielding a total of 144 distinct attention mechanism” (mismatch, in the TAPE paper there were only 8 heads)
  • 27.
    The core ofthe analysis
  • 28.
  • 29.
    Data Using two datasetsfrom TAPE for the analysis (not training): ● ProteinNet dataset (task 2, contact maps). Used for amino acids analysis & contact maps ● Secondary Structure dataset (task 1). Used for analysis of secondary structure & binding sites Plus a token-level binding site annotations from PDB. “For analyzing attention, we used a random subset of 5000 sequences from the respective training splits as this analysis was purely evaluative; for training the probing classifier, we used the full training splits for training the model, and the validation splits for evaluation.”
  • 30.
  • 31.
  • 32.
    Attention is notan explanation! Attention is not Explanation, https://arxiv.org/abs/1902.10186 “ In this work, we perform extensive experiments across a variety of NLP tasks that aim to assess the degree to which attention weights provide meaningful `explanations' for predictions. We find that they largely do not. For example, learned attention weights are frequently uncorrelated with gradient-based measures of feature importance, and one can identify very different attention distributions that nonetheless yield equivalent predictions. Our findings show that standard attention modules do not provide meaningful explanations and should not be treated as though they do.”
  • 33.
    Attention is notan explanation! Why Attention is Not Explanation: Surgical Intervention and Causal Reasoning about Neural Models, https://www.aclweb.org/anthology/2020.lrec-1.220/ “ From this analysis, we assert the impossibility of causal explanations from attention layers over text data. We then introduce NLP researchers to contemporary philosophy of science theories that allow robust yet non-causal reasoning in explanation, giving computer scientists a vocabulary for future research.”
  • 34.
    Attention is notan explanation! Learning to Deceive with Attention-Based Explanations, https://arxiv.org/abs/1909.07913 “We call the latter use of attention mechanisms into question by demonstrating a simple method for training models to produce deceptive attention masks. Our method diminishes the total weight assigned to designated impermissible tokens, even when the models can be shown to nevertheless rely on these features to drive predictions. ... Consequently, our results cast doubt on attention's reliability as a tool for auditing algorithms in the context of fairness and accountability.”
  • 35.
    Yet... Attention is notnot Explanation, https://arxiv.org/abs/1908.04626 “A recent paper claims that `Attention is not Explanation' (Jain and Wallace, 2019). ... We propose four alternative tests to determine when/whether attention can be used as explanation: a simple uniform-weights baseline; a variance calibration based on multiple random seed runs; a diagnostic framework using frozen weights from pretrained models; and an end-to-end adversarial attention training protocol. Each allows for meaningful interpretation of attention mechanisms in RNN models. We show that even when reliable adversarial distributions can be found, they don't perform well on the simple diagnostic, indicating that prior work does not disprove the usefulness of attention mechanisms for explainability.”
  • 36.
  • 37.
    Attention heads specializein certain types of aa “We computed the proportion of attention that each head focuses on particular types of amino acids, averaged over a dataset of 5000 sequences with a combined length of 1,067,712 amino acids. We found that for 14 of the 20 types of amino acids, there exists an attention head that focuses over 25% of attention on that amino acid. For example, Figure 2 shows that head 1-11 focuses 78% of its total attention on the amino acid Pro and head 12-3 focuses 27% of attention on Phe. Note that the maximum frequency of any single type of amino acid in the dataset is 9.4%.”
  • 38.
    Attention heads specializein certain types of aa
  • 39.
    Attention similarity vs.BLOSUM “We analyze how the attention received by amino acids relates to an existing measure of structural and functional properties: the substitution matrix. We assess whether attention tracks similar properties by computing the similarity of attention between each pair of amino acids and then comparing this metric to the pairwise similarity based on the substitution matrix. To measure attention similarity, we compute the Pearson correlation between the proportion of attention that each amino acid receives across heads. For example, to measure the attention similarity between Pro and Phe, we take the Pearson correlation of the two heatmaps in Figure 2. The values of all such pairwise correlations are shown in Figure 3a. We compare these scores to the BLOSUM scores in Figure 3b, and find a Pearson correlation of 0.80, suggesting that attention is largely consistent with substitution relationships.”
  • 40.
  • 41.
    Attention aligns stronglywith contact maps “Figure 4 shows the percentage of each head’s attention that aligns with contact maps. A single head, 12-4, aligns much more strongly with contact maps (28% of attention) than any of the other heads (maximum 7% of attention). In cases where the attention weight in head 12-4 is greater than 0.9, the alignment increases to 76%. In contrast, the frequency of contact pairs among all token pairs in the dataset is 1.3%.”
  • 42.
    Attention aligns stronglywith contact maps
  • 43.
    Example protein andthe induced attn from h12-4
  • 44.
    Binding sites “Figure 6shows the proportion of attention focused on binding sites by each head. In most layers, the mean percentage across heads is significantly higher than the background frequency of binding sites (4.8%). The effect is strongest in the last 6 layers of the model, which include 15 heads that each focus over 20% of their attention on binding sites.”
  • 45.
    “Head 7-1 focusesthe most attention on binding sites (34%).”
  • 46.
    Attention targets bindingsites “Figure 7 shows the estimated probability of head 7-1 targeting a binding site, as a function of the attention weight. We also find that tokens often target binding sites from far away in the sequence. In Head 7-1, for example, the average distance spanned by attention to binding sites is 124 tokens.”
  • 47.
    Attention targets higher-levelprops in deeper layers “As shown in Figure 8, deeper layers focus relatively more attention on binding sites and contacts (high-level concept), whereas secondary structure (low- to mid-level concept) is targeted more evenly across layers.”
  • 48.
    “The probing analysis(Figure 9) similarly shows that the model first forms representations of secondary structure before fully encoding contact maps and binding sites. ”
  • 49.
  • 50.