Transformers in 2021

Transformers in 2021
Grigory Sapunov
DataFest Yerevan 2021
10.09.2021
gs@inten.to

Who am I?
● MD in CS (2002), PhD in AI (2006)
● ex-Yandex News Dev. Team Leader (2007-2012)
● CTO & co-founder of Intento (2016+) and
Berkeley SkyDeck alumni (Spring 2019)
● Member of Scientiﬁc Advisory Board
at Atlas Biomed
● Google Developer Expert in
Machine Learning

● Transformer architecture understanding
○ Original paper: https://arxiv.org/abs/1706.03762
○ Great visual explanation: http://jalammar.github.io/illustrated-transformer
○ Lecture #12 from my DL course
https://github.com/che-shr-cat/deep-learning-for-biology-hse-2019-course
● This talk is in some sense a follow-up talk for these two:
○ https://www.youtube.com/watch?v=KZ9NXYcXVBY (GDG DevParty)
○ https://www.youtube.com/watch?v=7e4LxIVENZA (GDG DevFest)
● Sidenote: many modern transformers are described and discussed in
our Telegram channel & chat on ML research papers:
https://t.me/gonzo_ML
Prerequisites

Recap: Transformer Architecture

Transformer
A new simple network architecture,
the Transformer:
● Is a Encoder-Decoder architecture
● Based solely on attention mechanisms
(no RNN/CNN)
● The major component in the
transformer is the unit of multi-head
self-attention mechanism.
● Fast: only matrix multiplications
● Strong results on standard WMT
datasets

Multi-head self-attention mechanism
Essentially, the Multi-Head Attention is just
several attention layers stacked together
with different linear transformations of the
same input.

The transformer adopts the scaled
dot-product attention: the output is a
weighted sum of the values, where the
weight assigned to each value is
determined by the dot-product of the
query with all the keys:
The input consists of queries and keys of
dimension dk
, and values of dimension dv
.
Scaled dot-product attention

Quadratic attention
Efﬁcient Transformers: A Survey
https://arxiv.org/abs/2009.06732

Problems with vanilla transformers
● It’s a pretty heavy model
→ hard to train, tricky training
schedule (warm-ups, cyclic
learning rates, etc)
● O(N2
) computational
complexity attention
mechanism → scales poorly
● limited context span (mostly
due to the complexity),
typically 512 tokens
→ can’t process long sequences.
● May need different implicit bias for other types of data (e.g. image,
sound, etc)

Directions in 2021
● (Still) Large transformers
● (Still) Efﬁcient transformers
● New modalities:
○ more image transformers
○ audio transformers
○ transformers in biology and other domains (graphs)
● Multimodalily: CLIP, DALLE, Performer + IO, …
● Artistic applications: CLIPDraw etc

Large models
http://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf

Large models in 2021
● (English) GPT-Neo (2.7B), GPT-J (6B),
Jurassic-1 (7.5B/178B)
● (Russian) ruGPT-3 (13B)
● (Chinese) CPM-2 (11B/198B* - MoE),
M6 (10B/100B), Wu Dao 2.0 (1.75T*),
PangGu-α (2.6B/13B/207B)
● (Korean) HyperCLOVA (204B)
● (Code) OpenAI Codex (12B),
Google’s (up to 137B)
● ByT5 (up to 12.9B)
● XLM-R XL/XXL (3.5B/10.7B)
● DeBERTa (1.5B)
● Switch Transformer (1.6T*)
● ERNIE 3.0 (10B)
● DALL·E (12B)
● Vision MoE (14.7B*)

Scaling laws
“Scaling Laws for Neural Language Models”

SuperGLUE
https://super.gluebenchmark.com/leaderboard

Large model training costs
“The Cost of Training NLP Models: A Concise Overview”

CO2
emissions
“Energy and Policy Considerations for Deep Learning in NLP”

Training Data Extraction
“Extracting Training Data from Large Language Models”

https://dl.acm.org/doi/10.1145/3442188.3445922

● Size Doesn’t Guarantee Diversity
○ Internet data overrepresenting younger users and those from developed countries.
○ Training data is sourced by scraping only specific sites (e.g. Reddit).
○ There are structural factors including moderation practices.
○ The current practice of filtering datasets can further attenuate specific voices.
● Static Data/Changing Social Views
○ The risk of ‘value-lock’, where the LM-reliant technology reifies older, less-inclusive
understandings.
○ Movements with no significant media attention will not be captured at all.
○ Given the compute costs it likely isn’t feasible to fully retrain LMs frequently enough.
● Encoding Bias
○ Large LMs exhibit various kinds of bias, including stereotypical associations or
negative sentiment towards specific groups.
○ Issues with training data: unreliable news sites, banned subreddits, etc.
○ Model auditing using automated systems that are not reliable themselves.
● Documentation debt
○ Datasets are both undocumented and too large to document post hoc.

“An LM is a system for haphazardly stitching together
sequences of linguistic forms it has observed in its vast
training data, according to probabilistic information
about how they combine, but without any reference to
meaning: a stochastic parrot. “
https://dl.acm.org/doi/10.1145/3442188.3445922

https://crfm.stanford.edu/
In recent years, a new successful paradigm for building AI systems has
emerged: Train one model on a huge amount of data and adapt it to
many applications. We call such a model a foundation model.
Foundation models (e.g., GPT-3) have demonstrated impressive behavior,
but can fail unexpectedly, harbor biases, and are poorly understood.
Nonetheless, they are being deployed at scale.
The Center for Research on Foundation Models (CRFM) is an
interdisciplinary initiative born out of the Stanford Institute for
Human-Centered Artiﬁcial Intelligence (HAI) that aims to make
fundamental advances in the study, development, and deployment of
foundation models.

“Efﬁcient Transformers: A Survey”

Some recent architectural innovations
Switch Transformers:
Mixture of Experts (MoE)
architecture with only a single
expert per feed-forward layer.
Scales well with more experts.
Adds a new dimension of
scaling: ‘expert-parallelism’ in
addition to data- and
model-parallelism.
“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efﬁcient Sparsity”

Balanced assignment of experts
(BASE) layer:
A new kind of sparse expert model (similar
to MoE transformer or Switch transformer)
that algorithmically balance the
token-to-expert assignments (without any
new hyperparameters or auxiliary losses).
Distributes well across many GPUs (say,
128).
“BASE Layers: Simplifying Training of Large, Sparse Models”

A simple yet highly accurate
approximation for vanilla attention:
● its memory usage is linear in the
input size, similar to linear attention
variants, such as Performer and RFA
● it is a drop-in replacement for vanilla
attention that does not require any
corrective pre-training
● it can also lead to signiﬁcant memory
savings in the feed-forward layers after
casting them into the familiar
query-key-value framework.
“Memory-efﬁcient Transformers via Top-k Attention”

Expire-Span Transformer:
● learns to retain the most important
information and expire the irrelevant
information
● scales to attend over tens of
thousands of previous timesteps
efﬁciently, as not all states from
previous timesteps are preserved
“Not All Memories are Created Equal: Learning to Forget by Expiring”

Image Transformers
There were many transformers for images already:
● Image Transformer (https://arxiv.org/abs/1802.05751)
● Sparse Transformer
(https://arxiv.org/abs/1904.10509)
● Image GPT (iGPT): just a GPT-2 trained on images
unrolled into long sequences of pixels
(https://openai.com/blog/image-gpt/)
● Axial Transformer: for images and other data
organized as high dim tensors
(https://arxiv.org/abs/1912.12180).

Image Transformers
Many more emerged in 2020-2021:
● Vision Transformer (ViT)
● Data-efﬁcient image
Transformer (DeiT)
● Bottleneck Transformers (BoTNet)
● Vision MoE (V-MoE)
● Image Processing Transformer (IPT)
● Detection Transformer (DETR)
● TransGAN
● ...

“Transformers in Vision: A Survey”

Some New Transformers for Images
“Bottleneck Transformers for Visual Recognition”

Vision Transformer (ViT)
● Image is split into patches (e.g. 16x16), ﬂatten into a 1D sequence, then put
into a transformer encoder (similar to BERT).
“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”

Data-efﬁcient image Transformer (DeiT)
The architecture is identical to ViT with the
only differences are the training strategies,
and the distillation token.
“Training data-efﬁcient image transformers & distillation through attention”

Bottleneck Transformers (BoTNet)
● A hybrid model with ResNet +
Transformer
● Replacing internal 3x3 convolutions
inside a ResNet block (only the last
three) with Multi-head Self-Attention.
● The architecture called BoTNet scales
pretty well.
“Bottleneck Transformers for Visual Recognition”

Vision MoE (V-MoE)
● A sparse variant of the recent Vision Transformer (ViT) architecture for image
classiﬁcation.
● .The V-MoE replaces a subset of the dense feedforward layers in ViT with
sparse MoE layers, where each image patch is “routed” to a subset of
“experts” (MLPs).
● Scales to model sizes of 15B parameters, the largest vision models to date.
“Scaling Vision with Sparse Mixture of Experts”

Speech and Sound Transformers
There were many transformers for sound as well:
● Speech-Transformer (https://ieeexplore.ieee.org/document/8462506)
● Conformer (https://arxiv.org/abs/2005.08100)
● Transformer-Transducer (https://arxiv.org/abs/1910.12977)
● Transformer-Transducer(https://arxiv.org/abs/2002.02562)
● Conv-Transformer Transducer (https://arxiv.org/abs/2008.05750)
● Speech-XLNet (https://arxiv.org/abs/1910.10387)
● Audio ALBERT (https://arxiv.org/abs/2005.08575)
● Emformer (https://arxiv.org/abs/2010.10759)
● wav2vec 2.0 (https://arxiv.org/abs/2006.11477)
● ...

AST: Audio Spectrogram Transformer
“AST: Audio Spectrogram Transformer”
A convolution-free, purely attention-based
model for audio classiﬁcation.
Very close to ViT, but AST can process
variable-length audio inputs.

ACT: Audio Captioning Transformer
“Audio Captioning Transformer”
Another convolution-free Transformer
based on an encoder-decoder
architecture.

Multi-channel Transformer for ASR
“End-to-End Multi-Channel Transformer for Speech Recognition”

Transformers in Biology
Finally transformers came into biology!
● ESM-1b protein language model
(https://www.pnas.org/content/118/15/e2016239118)
● MSA Transformer for multiple sequence alignment
(https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1)
● RoseTTAFold for predicting protein structures (includes graph
transformers)
(https://www.science.org/doi/abs/10.1126/science.abj8754)
● AlphaFold2 for predicting protein structures
(https://www.nature.com/articles/s41586-021-03819-2)

ESM-1b
“Biological structure and function emerge from scaling unsupervised learning to 250 million protein
sequences”, https://www.pnas.org/content/118/15/e2016239118

RoseTTAFold
“Accurate prediction of protein structures and interactions using a 3-track network”
https://www.science.org/doi/abs/10.1126/science.abj8754

AlphaFold 2
“Highly accurate protein structure prediction with AlphaFold”
https://www.nature.com/articles/s41586-021-03819-2

AlphaFold 2: Evoformer block
“Highly accurate protein structure prediction with AlphaFold”
https://www.nature.com/articles/s41586-021-03819-2

DALL·E (OpenAI)
“Zero-Shot Text-to-Image Generation”
A model trained on images+text
descriptions.
Autoregressively generates image tokens
based on previous text and (optionally)
image tokens.
Technically a transformer decoder.
Image tokens are obtained with a
pretrained dVAE.
Candidates are ranked using CLIP.

CLIP (OpenAI)
“Learning Transferable Visual Models From Natural Language Supervision”
Uses contrastive pre-training to predict which caption goes with which image.

ALIGN (Google)
https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html
“Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision”
Train EfﬁcientNet-L2 (image encoder) and BERT-large (text encoder) with a
contrastive loss on a huge noisy dataset (1.8B image-text pairs).

CLIPDraw
“CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders”
You can optimize the image to better match a text description (remember
DeepDream?).

CLIPDraw
“CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders”
The image is rendered from a set of bezier curves.

https://twitter.com/RiversHaveWings/status/1410020043178446848
“a beautiful epic wondrous fantasy painting of the ocean”

CLIP + PixelDraw
https://www.reddit.com/r/MediaSynthesis/comments/pf7ru8/set_of_asianthemed_graphics_generated_with_clipit/

Perceiver (Google)
“Perceiver: General Perception with Iterative Attention”

Perceiver IO (Google)
“Perceiver IO: A General Architecture for Structured Inputs & Outputs”

https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!

Transformers in 2021

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Transformers in 2021

Similar to Transformers in 2021 (20)

More from Grigory Sapunov

More from Grigory Sapunov (20)

Recently uploaded

Recently uploaded (20)

Transformers in 2021