Transformers in 2021
Grigory Sapunov
DataFest Yerevan 2021
10.09.2021
gs@inten.to
Who am I?
● MD in CS (2002), PhD in AI (2006)
● ex-Yandex News Dev. Team Leader (2007-2012)
● CTO & co-founder of Intento (2016+) and
Berkeley SkyDeck alumni (Spring 2019)
● Member of Scientific Advisory Board
at Atlas Biomed
● Google Developer Expert in
Machine Learning
● Transformer architecture understanding
○ Original paper: https://arxiv.org/abs/1706.03762
○ Great visual explanation: http://jalammar.github.io/illustrated-transformer
○ Lecture #12 from my DL course
https://github.com/che-shr-cat/deep-learning-for-biology-hse-2019-course
● This talk is in some sense a follow-up talk for these two:
○ https://www.youtube.com/watch?v=KZ9NXYcXVBY (GDG DevParty)
○ https://www.youtube.com/watch?v=7e4LxIVENZA (GDG DevFest)
● Sidenote: many modern transformers are described and discussed in
our Telegram channel & chat on ML research papers:
https://t.me/gonzo_ML
Prerequisites
Recap: Transformer Architecture
Transformer
A new simple network architecture,
the Transformer:
● Is a Encoder-Decoder architecture
● Based solely on attention mechanisms
(no RNN/CNN)
● The major component in the
transformer is the unit of multi-head
self-attention mechanism.
● Fast: only matrix multiplications
● Strong results on standard WMT
datasets
Multi-head self-attention mechanism
Essentially, the Multi-Head Attention is just
several attention layers stacked together
with different linear transformations of the
same input.
The transformer adopts the scaled
dot-product attention: the output is a
weighted sum of the values, where the
weight assigned to each value is
determined by the dot-product of the
query with all the keys:
The input consists of queries and keys of
dimension dk
, and values of dimension dv
.
Scaled dot-product attention
Quadratic attention
Efficient Transformers: A Survey
https://arxiv.org/abs/2009.06732
Problems with vanilla transformers
● It’s a pretty heavy model
→ hard to train, tricky training
schedule (warm-ups, cyclic
learning rates, etc)
● O(N2
) computational
complexity attention
mechanism → scales poorly
● limited context span (mostly
due to the complexity),
typically 512 tokens
→ can’t process long sequences.
● May need different implicit bias for other types of data (e.g. image,
sound, etc)
Year 2021 directions
Directions in 2021
● (Still) Large transformers
● (Still) Efficient transformers
● New modalities:
○ more image transformers
○ audio transformers
○ transformers in biology and other domains (graphs)
● Multimodalily: CLIP, DALLE, Performer + IO, …
● Artistic applications: CLIPDraw etc
1. Large Transformers
Large models
http://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf
Large models in 2021
● (English) GPT-Neo (2.7B), GPT-J (6B),
Jurassic-1 (7.5B/178B)
● (Russian) ruGPT-3 (13B)
● (Chinese) CPM-2 (11B/198B* - MoE),
M6 (10B/100B), Wu Dao 2.0 (1.75T*),
PangGu-α (2.6B/13B/207B)
● (Korean) HyperCLOVA (204B)
● (Code) OpenAI Codex (12B),
Google’s (up to 137B)
● ByT5 (up to 12.9B)
● XLM-R XL/XXL (3.5B/10.7B)
● DeBERTa (1.5B)
● Switch Transformer (1.6T*)
● ERNIE 3.0 (10B)
● DALL·E (12B)
● Vision MoE (14.7B*)
Scaling laws
“Scaling Laws for Neural Language Models”
https://arxiv.org/abs/2001.08361
Scaling laws
“Scaling Laws for Neural Language Models”
https://arxiv.org/abs/2001.08361
SuperGLUE
https://super.gluebenchmark.com/leaderboard
1*. Problems of Large Models
Costs
Large model training costs
“The Cost of Training NLP Models: A Concise Overview”
https://arxiv.org/abs/2004.08900
CO2
emissions
“Energy and Policy Considerations for Deep Learning in NLP”
https://arxiv.org/abs/1906.02243
Training Data Extraction
“Extracting Training Data from Large Language Models”
https://arxiv.org/abs/2012.07805
https://dl.acm.org/doi/10.1145/3442188.3445922
● Size Doesn’t Guarantee Diversity
○ Internet data overrepresenting younger users and those from developed countries.
○ Training data is sourced by scraping only specific sites (e.g. Reddit).
○ There are structural factors including moderation practices.
○ The current practice of filtering datasets can further attenuate specific voices.
● Static Data/Changing Social Views
○ The risk of ‘value-lock’, where the LM-reliant technology reifies older, less-inclusive
understandings.
○ Movements with no significant media attention will not be captured at all.
○ Given the compute costs it likely isn’t feasible to fully retrain LMs frequently enough.
● Encoding Bias
○ Large LMs exhibit various kinds of bias, including stereotypical associations or
negative sentiment towards specific groups.
○ Issues with training data: unreliable news sites, banned subreddits, etc.
○ Model auditing using automated systems that are not reliable themselves.
● Documentation debt
○ Datasets are both undocumented and too large to document post hoc.
“An LM is a system for haphazardly stitching together
sequences of linguistic forms it has observed in its vast
training data, according to probabilistic information
about how they combine, but without any reference to
meaning: a stochastic parrot. “
https://dl.acm.org/doi/10.1145/3442188.3445922
https://crfm.stanford.edu/
In recent years, a new successful paradigm for building AI systems has
emerged: Train one model on a huge amount of data and adapt it to
many applications. We call such a model a foundation model.
Foundation models (e.g., GPT-3) have demonstrated impressive behavior,
but can fail unexpectedly, harbor biases, and are poorly understood.
Nonetheless, they are being deployed at scale.
The Center for Research on Foundation Models (CRFM) is an
interdisciplinary initiative born out of the Stanford Institute for
Human-Centered Artificial Intelligence (HAI) that aims to make
fundamental advances in the study, development, and deployment of
foundation models.
https://arxiv.org/abs/2108.07258
https://arxiv.org/abs/2108.07258
https://arxiv.org/abs/2108.07258
2. Efficient Transformers
“Efficient Transformers: A Survey”
https://arxiv.org/abs/2009.06732
“Efficient Transformers: A Survey”
https://arxiv.org/abs/2009.06732
Some recent architectural innovations
Switch Transformers:
Mixture of Experts (MoE)
architecture with only a single
expert per feed-forward layer.
Scales well with more experts.
Adds a new dimension of
scaling: ‘expert-parallelism’ in
addition to data- and
model-parallelism.
“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”
https://arxiv.org/abs/2101.03961
Some recent architectural innovations
Balanced assignment of experts
(BASE) layer:
A new kind of sparse expert model (similar
to MoE transformer or Switch transformer)
that algorithmically balance the
token-to-expert assignments (without any
new hyperparameters or auxiliary losses).
Distributes well across many GPUs (say,
128).
“BASE Layers: Simplifying Training of Large, Sparse Models”
https://arxiv.org/abs/2103.16716
Some recent architectural innovations
A simple yet highly accurate
approximation for vanilla attention:
● its memory usage is linear in the
input size, similar to linear attention
variants, such as Performer and RFA
● it is a drop-in replacement for vanilla
attention that does not require any
corrective pre-training
● it can also lead to significant memory
savings in the feed-forward layers after
casting them into the familiar
query-key-value framework.
“Memory-efficient Transformers via Top-k Attention”
https://arxiv.org/abs/2106.06899
Some recent architectural innovations
Expire-Span Transformer:
● learns to retain the most important
information and expire the irrelevant
information
● scales to attend over tens of
thousands of previous timesteps
efficiently, as not all states from
previous timesteps are preserved
“Not All Memories are Created Equal: Learning to Forget by Expiring”
https://arxiv.org/abs/2105.06548
3. New Modalities
Image Transformers
There were many transformers for images already:
● Image Transformer (https://arxiv.org/abs/1802.05751)
● Sparse Transformer
(https://arxiv.org/abs/1904.10509)
● Image GPT (iGPT): just a GPT-2 trained on images
unrolled into long sequences of pixels
(https://openai.com/blog/image-gpt/)
● Axial Transformer: for images and other data
organized as high dim tensors
(https://arxiv.org/abs/1912.12180).
Image Transformers
Many more emerged in 2020-2021:
● Vision Transformer (ViT)
● Data-efficient image
Transformer (DeiT)
● Bottleneck Transformers (BoTNet)
● Vision MoE (V-MoE)
● Image Processing Transformer (IPT)
● Detection Transformer (DETR)
● TransGAN
● ...
“Transformers in Vision: A Survey”
https://arxiv.org/abs/2101.01169
Some New Transformers for Images
“Bottleneck Transformers for Visual Recognition”
https://arxiv.org/abs/2101.11605
Vision Transformer (ViT)
● Image is split into patches (e.g. 16x16), flatten into a 1D sequence, then put
into a transformer encoder (similar to BERT).
“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”
https://arxiv.org/abs/2010.11929
Data-efficient image Transformer (DeiT)
The architecture is identical to ViT with the
only differences are the training strategies,
and the distillation token.
“Training data-efficient image transformers & distillation through attention”
https://arxiv.org/abs/2012.12877
Bottleneck Transformers (BoTNet)
● A hybrid model with ResNet +
Transformer
● Replacing internal 3x3 convolutions
inside a ResNet block (only the last
three) with Multi-head Self-Attention.
● The architecture called BoTNet scales
pretty well.
“Bottleneck Transformers for Visual Recognition”
https://arxiv.org/abs/2101.11605
Vision MoE (V-MoE)
● A sparse variant of the recent Vision Transformer (ViT) architecture for image
classification.
● .The V-MoE replaces a subset of the dense feedforward layers in ViT with
sparse MoE layers, where each image patch is “routed” to a subset of
“experts” (MLPs).
● Scales to model sizes of 15B parameters, the largest vision models to date.
“Scaling Vision with Sparse Mixture of Experts”
https://arxiv.org/abs/2106.05974
Speech and Sound Transformers
There were many transformers for sound as well:
● Speech-Transformer (https://ieeexplore.ieee.org/document/8462506)
● Conformer (https://arxiv.org/abs/2005.08100)
● Transformer-Transducer (https://arxiv.org/abs/1910.12977)
● Transformer-Transducer(https://arxiv.org/abs/2002.02562)
● Conv-Transformer Transducer (https://arxiv.org/abs/2008.05750)
● Speech-XLNet (https://arxiv.org/abs/1910.10387)
● Audio ALBERT (https://arxiv.org/abs/2005.08575)
● Emformer (https://arxiv.org/abs/2010.10759)
● wav2vec 2.0 (https://arxiv.org/abs/2006.11477)
● ...
AST: Audio Spectrogram Transformer
“AST: Audio Spectrogram Transformer”
https://arxiv.org/abs/2104.01778
A convolution-free, purely attention-based
model for audio classification.
Very close to ViT, but AST can process
variable-length audio inputs.
ACT: Audio Captioning Transformer
“Audio Captioning Transformer”
https://arxiv.org/abs/2107.09817
Another convolution-free Transformer
based on an encoder-decoder
architecture.
Multi-channel Transformer for ASR
“End-to-End Multi-Channel Transformer for Speech Recognition”
https://arxiv.org/abs/2102.03951
Transformers in Biology
Finally transformers came into biology!
● ESM-1b protein language model
(https://www.pnas.org/content/118/15/e2016239118)
● MSA Transformer for multiple sequence alignment
(https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1)
● RoseTTAFold for predicting protein structures (includes graph
transformers)
(https://www.science.org/doi/abs/10.1126/science.abj8754)
● AlphaFold2 for predicting protein structures
(https://www.nature.com/articles/s41586-021-03819-2)
ESM-1b
“Biological structure and function emerge from scaling unsupervised learning to 250 million protein
sequences”, https://www.pnas.org/content/118/15/e2016239118
RoseTTAFold
“Accurate prediction of protein structures and interactions using a 3-track network”
https://www.science.org/doi/abs/10.1126/science.abj8754
AlphaFold 2
“Highly accurate protein structure prediction with AlphaFold”
https://www.nature.com/articles/s41586-021-03819-2
AlphaFold 2: Evoformer block
“Highly accurate protein structure prediction with AlphaFold”
https://www.nature.com/articles/s41586-021-03819-2
4. Multi-Modal Transformers
https://arxiv.org/abs/2101.01169
DALL·E (OpenAI)
“Zero-Shot Text-to-Image Generation”
https://arxiv.org/abs/2102.12092
A model trained on images+text
descriptions.
Autoregressively generates image tokens
based on previous text and (optionally)
image tokens.
Technically a transformer decoder.
Image tokens are obtained with a
pretrained dVAE.
Candidates are ranked using CLIP.
CLIP (OpenAI)
“Learning Transferable Visual Models From Natural Language Supervision”
https://arxiv.org/abs/2103.00020
Uses contrastive pre-training to predict which caption goes with which image.
ALIGN (Google)
https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html
“Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision”
https://arxiv.org/abs/2102.05918
Train EfficientNet-L2 (image encoder) and BERT-large (text encoder) with a
contrastive loss on a huge noisy dataset (1.8B image-text pairs).
CLIPDraw
“CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders”
https://arxiv.org/abs/2106.14843
You can optimize the image to better match a text description (remember
DeepDream?).
CLIPDraw
“CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders”
https://arxiv.org/abs/2106.14843
The image is rendered from a set of bezier curves.
https://twitter.com/RiversHaveWings/status/1410020043178446848
“a beautiful epic wondrous fantasy painting of the ocean”
CLIP + PixelDraw
https://www.reddit.com/r/MediaSynthesis/comments/pf7ru8/set_of_asianthemed_graphics_generated_with_clipit/
Perceiver (Google)
“Perceiver: General Perception with Iterative Attention”
https://arxiv.org/abs/2103.03206
Perceiver IO (Google)
“Perceiver IO: A General Architecture for Structured Inputs & Outputs”
https://arxiv.org/abs/2107.14795
https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!

Transformers in 2021

  • 1.
    Transformers in 2021 GrigorySapunov DataFest Yerevan 2021 10.09.2021 gs@inten.to
  • 2.
    Who am I? ●MD in CS (2002), PhD in AI (2006) ● ex-Yandex News Dev. Team Leader (2007-2012) ● CTO & co-founder of Intento (2016+) and Berkeley SkyDeck alumni (Spring 2019) ● Member of Scientific Advisory Board at Atlas Biomed ● Google Developer Expert in Machine Learning
  • 3.
    ● Transformer architectureunderstanding ○ Original paper: https://arxiv.org/abs/1706.03762 ○ Great visual explanation: http://jalammar.github.io/illustrated-transformer ○ Lecture #12 from my DL course https://github.com/che-shr-cat/deep-learning-for-biology-hse-2019-course ● This talk is in some sense a follow-up talk for these two: ○ https://www.youtube.com/watch?v=KZ9NXYcXVBY (GDG DevParty) ○ https://www.youtube.com/watch?v=7e4LxIVENZA (GDG DevFest) ● Sidenote: many modern transformers are described and discussed in our Telegram channel & chat on ML research papers: https://t.me/gonzo_ML Prerequisites
  • 4.
  • 5.
    Transformer A new simplenetwork architecture, the Transformer: ● Is a Encoder-Decoder architecture ● Based solely on attention mechanisms (no RNN/CNN) ● The major component in the transformer is the unit of multi-head self-attention mechanism. ● Fast: only matrix multiplications ● Strong results on standard WMT datasets
  • 7.
    Multi-head self-attention mechanism Essentially,the Multi-Head Attention is just several attention layers stacked together with different linear transformations of the same input.
  • 8.
    The transformer adoptsthe scaled dot-product attention: the output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product of the query with all the keys: The input consists of queries and keys of dimension dk , and values of dimension dv . Scaled dot-product attention
  • 9.
    Quadratic attention Efficient Transformers:A Survey https://arxiv.org/abs/2009.06732
  • 10.
    Problems with vanillatransformers ● It’s a pretty heavy model → hard to train, tricky training schedule (warm-ups, cyclic learning rates, etc) ● O(N2 ) computational complexity attention mechanism → scales poorly ● limited context span (mostly due to the complexity), typically 512 tokens → can’t process long sequences. ● May need different implicit bias for other types of data (e.g. image, sound, etc)
  • 11.
  • 12.
    Directions in 2021 ●(Still) Large transformers ● (Still) Efficient transformers ● New modalities: ○ more image transformers ○ audio transformers ○ transformers in biology and other domains (graphs) ● Multimodalily: CLIP, DALLE, Performer + IO, … ● Artistic applications: CLIPDraw etc
  • 13.
  • 14.
  • 15.
    Large models in2021 ● (English) GPT-Neo (2.7B), GPT-J (6B), Jurassic-1 (7.5B/178B) ● (Russian) ruGPT-3 (13B) ● (Chinese) CPM-2 (11B/198B* - MoE), M6 (10B/100B), Wu Dao 2.0 (1.75T*), PangGu-α (2.6B/13B/207B) ● (Korean) HyperCLOVA (204B) ● (Code) OpenAI Codex (12B), Google’s (up to 137B) ● ByT5 (up to 12.9B) ● XLM-R XL/XXL (3.5B/10.7B) ● DeBERTa (1.5B) ● Switch Transformer (1.6T*) ● ERNIE 3.0 (10B) ● DALL·E (12B) ● Vision MoE (14.7B*)
  • 16.
    Scaling laws “Scaling Lawsfor Neural Language Models” https://arxiv.org/abs/2001.08361
  • 17.
    Scaling laws “Scaling Lawsfor Neural Language Models” https://arxiv.org/abs/2001.08361
  • 18.
  • 19.
    1*. Problems ofLarge Models
  • 20.
  • 21.
    Large model trainingcosts “The Cost of Training NLP Models: A Concise Overview” https://arxiv.org/abs/2004.08900
  • 22.
    CO2 emissions “Energy and PolicyConsiderations for Deep Learning in NLP” https://arxiv.org/abs/1906.02243
  • 23.
    Training Data Extraction “ExtractingTraining Data from Large Language Models” https://arxiv.org/abs/2012.07805
  • 24.
  • 25.
    ● Size Doesn’tGuarantee Diversity ○ Internet data overrepresenting younger users and those from developed countries. ○ Training data is sourced by scraping only specific sites (e.g. Reddit). ○ There are structural factors including moderation practices. ○ The current practice of filtering datasets can further attenuate specific voices. ● Static Data/Changing Social Views ○ The risk of ‘value-lock’, where the LM-reliant technology reifies older, less-inclusive understandings. ○ Movements with no significant media attention will not be captured at all. ○ Given the compute costs it likely isn’t feasible to fully retrain LMs frequently enough. ● Encoding Bias ○ Large LMs exhibit various kinds of bias, including stereotypical associations or negative sentiment towards specific groups. ○ Issues with training data: unreliable news sites, banned subreddits, etc. ○ Model auditing using automated systems that are not reliable themselves. ● Documentation debt ○ Datasets are both undocumented and too large to document post hoc.
  • 26.
    “An LM isa system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot. “ https://dl.acm.org/doi/10.1145/3442188.3445922
  • 27.
    https://crfm.stanford.edu/ In recent years,a new successful paradigm for building AI systems has emerged: Train one model on a huge amount of data and adapt it to many applications. We call such a model a foundation model. Foundation models (e.g., GPT-3) have demonstrated impressive behavior, but can fail unexpectedly, harbor biases, and are poorly understood. Nonetheless, they are being deployed at scale. The Center for Research on Foundation Models (CRFM) is an interdisciplinary initiative born out of the Stanford Institute for Human-Centered Artificial Intelligence (HAI) that aims to make fundamental advances in the study, development, and deployment of foundation models.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
    “Efficient Transformers: ASurvey” https://arxiv.org/abs/2009.06732
  • 33.
    “Efficient Transformers: ASurvey” https://arxiv.org/abs/2009.06732
  • 34.
    Some recent architecturalinnovations Switch Transformers: Mixture of Experts (MoE) architecture with only a single expert per feed-forward layer. Scales well with more experts. Adds a new dimension of scaling: ‘expert-parallelism’ in addition to data- and model-parallelism. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity” https://arxiv.org/abs/2101.03961
  • 35.
    Some recent architecturalinnovations Balanced assignment of experts (BASE) layer: A new kind of sparse expert model (similar to MoE transformer or Switch transformer) that algorithmically balance the token-to-expert assignments (without any new hyperparameters or auxiliary losses). Distributes well across many GPUs (say, 128). “BASE Layers: Simplifying Training of Large, Sparse Models” https://arxiv.org/abs/2103.16716
  • 36.
    Some recent architecturalinnovations A simple yet highly accurate approximation for vanilla attention: ● its memory usage is linear in the input size, similar to linear attention variants, such as Performer and RFA ● it is a drop-in replacement for vanilla attention that does not require any corrective pre-training ● it can also lead to significant memory savings in the feed-forward layers after casting them into the familiar query-key-value framework. “Memory-efficient Transformers via Top-k Attention” https://arxiv.org/abs/2106.06899
  • 37.
    Some recent architecturalinnovations Expire-Span Transformer: ● learns to retain the most important information and expire the irrelevant information ● scales to attend over tens of thousands of previous timesteps efficiently, as not all states from previous timesteps are preserved “Not All Memories are Created Equal: Learning to Forget by Expiring” https://arxiv.org/abs/2105.06548
  • 38.
  • 39.
    Image Transformers There weremany transformers for images already: ● Image Transformer (https://arxiv.org/abs/1802.05751) ● Sparse Transformer (https://arxiv.org/abs/1904.10509) ● Image GPT (iGPT): just a GPT-2 trained on images unrolled into long sequences of pixels (https://openai.com/blog/image-gpt/) ● Axial Transformer: for images and other data organized as high dim tensors (https://arxiv.org/abs/1912.12180).
  • 40.
    Image Transformers Many moreemerged in 2020-2021: ● Vision Transformer (ViT) ● Data-efficient image Transformer (DeiT) ● Bottleneck Transformers (BoTNet) ● Vision MoE (V-MoE) ● Image Processing Transformer (IPT) ● Detection Transformer (DETR) ● TransGAN ● ...
  • 41.
    “Transformers in Vision:A Survey” https://arxiv.org/abs/2101.01169
  • 42.
    Some New Transformersfor Images “Bottleneck Transformers for Visual Recognition” https://arxiv.org/abs/2101.11605
  • 43.
    Vision Transformer (ViT) ●Image is split into patches (e.g. 16x16), flatten into a 1D sequence, then put into a transformer encoder (similar to BERT). “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” https://arxiv.org/abs/2010.11929
  • 44.
    Data-efficient image Transformer(DeiT) The architecture is identical to ViT with the only differences are the training strategies, and the distillation token. “Training data-efficient image transformers & distillation through attention” https://arxiv.org/abs/2012.12877
  • 45.
    Bottleneck Transformers (BoTNet) ●A hybrid model with ResNet + Transformer ● Replacing internal 3x3 convolutions inside a ResNet block (only the last three) with Multi-head Self-Attention. ● The architecture called BoTNet scales pretty well. “Bottleneck Transformers for Visual Recognition” https://arxiv.org/abs/2101.11605
  • 46.
    Vision MoE (V-MoE) ●A sparse variant of the recent Vision Transformer (ViT) architecture for image classification. ● .The V-MoE replaces a subset of the dense feedforward layers in ViT with sparse MoE layers, where each image patch is “routed” to a subset of “experts” (MLPs). ● Scales to model sizes of 15B parameters, the largest vision models to date. “Scaling Vision with Sparse Mixture of Experts” https://arxiv.org/abs/2106.05974
  • 47.
    Speech and SoundTransformers There were many transformers for sound as well: ● Speech-Transformer (https://ieeexplore.ieee.org/document/8462506) ● Conformer (https://arxiv.org/abs/2005.08100) ● Transformer-Transducer (https://arxiv.org/abs/1910.12977) ● Transformer-Transducer(https://arxiv.org/abs/2002.02562) ● Conv-Transformer Transducer (https://arxiv.org/abs/2008.05750) ● Speech-XLNet (https://arxiv.org/abs/1910.10387) ● Audio ALBERT (https://arxiv.org/abs/2005.08575) ● Emformer (https://arxiv.org/abs/2010.10759) ● wav2vec 2.0 (https://arxiv.org/abs/2006.11477) ● ...
  • 48.
    AST: Audio SpectrogramTransformer “AST: Audio Spectrogram Transformer” https://arxiv.org/abs/2104.01778 A convolution-free, purely attention-based model for audio classification. Very close to ViT, but AST can process variable-length audio inputs.
  • 49.
    ACT: Audio CaptioningTransformer “Audio Captioning Transformer” https://arxiv.org/abs/2107.09817 Another convolution-free Transformer based on an encoder-decoder architecture.
  • 50.
    Multi-channel Transformer forASR “End-to-End Multi-Channel Transformer for Speech Recognition” https://arxiv.org/abs/2102.03951
  • 51.
    Transformers in Biology Finallytransformers came into biology! ● ESM-1b protein language model (https://www.pnas.org/content/118/15/e2016239118) ● MSA Transformer for multiple sequence alignment (https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1) ● RoseTTAFold for predicting protein structures (includes graph transformers) (https://www.science.org/doi/abs/10.1126/science.abj8754) ● AlphaFold2 for predicting protein structures (https://www.nature.com/articles/s41586-021-03819-2)
  • 52.
    ESM-1b “Biological structure andfunction emerge from scaling unsupervised learning to 250 million protein sequences”, https://www.pnas.org/content/118/15/e2016239118
  • 53.
    RoseTTAFold “Accurate prediction ofprotein structures and interactions using a 3-track network” https://www.science.org/doi/abs/10.1126/science.abj8754
  • 54.
    AlphaFold 2 “Highly accurateprotein structure prediction with AlphaFold” https://www.nature.com/articles/s41586-021-03819-2
  • 55.
    AlphaFold 2: Evoformerblock “Highly accurate protein structure prediction with AlphaFold” https://www.nature.com/articles/s41586-021-03819-2
  • 56.
  • 57.
  • 58.
    DALL·E (OpenAI) “Zero-Shot Text-to-ImageGeneration” https://arxiv.org/abs/2102.12092 A model trained on images+text descriptions. Autoregressively generates image tokens based on previous text and (optionally) image tokens. Technically a transformer decoder. Image tokens are obtained with a pretrained dVAE. Candidates are ranked using CLIP.
  • 59.
    CLIP (OpenAI) “Learning TransferableVisual Models From Natural Language Supervision” https://arxiv.org/abs/2103.00020 Uses contrastive pre-training to predict which caption goes with which image.
  • 60.
    ALIGN (Google) https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html “Scaling UpVisual and Vision-Language Representation Learning With Noisy Text Supervision” https://arxiv.org/abs/2102.05918 Train EfficientNet-L2 (image encoder) and BERT-large (text encoder) with a contrastive loss on a huge noisy dataset (1.8B image-text pairs).
  • 61.
    CLIPDraw “CLIPDraw: Exploring Text-to-DrawingSynthesis through Language-Image Encoders” https://arxiv.org/abs/2106.14843 You can optimize the image to better match a text description (remember DeepDream?).
  • 62.
    CLIPDraw “CLIPDraw: Exploring Text-to-DrawingSynthesis through Language-Image Encoders” https://arxiv.org/abs/2106.14843 The image is rendered from a set of bezier curves.
  • 63.
  • 64.
  • 65.
    Perceiver (Google) “Perceiver: GeneralPerception with Iterative Attention” https://arxiv.org/abs/2103.03206
  • 66.
    Perceiver IO (Google) “PerceiverIO: A General Architecture for Structured Inputs & Outputs” https://arxiv.org/abs/2107.14795
  • 67.