2. Who am I?
● MD in CS (2002), PhD in AI (2006)
● ex-Yandex News Dev. Team Leader (2007-2012)
● CTO & co-founder of Intento (2016+) and
Berkeley SkyDeck alumni (Spring 2019)
● Member of Scientific Advisory Board
at Atlas Biomed
● Google Developer Expert in
Machine Learning
3. ● Transformer architecture understanding
○ Original paper: https://arxiv.org/abs/1706.03762
○ Great visual explanation: http://jalammar.github.io/illustrated-transformer
○ Lecture #12 from my DL course
https://github.com/che-shr-cat/deep-learning-for-biology-hse-2019-course
● This talk is in some sense a follow-up talk for these two:
○ https://www.youtube.com/watch?v=KZ9NXYcXVBY (GDG DevParty)
○ https://www.youtube.com/watch?v=7e4LxIVENZA (GDG DevFest)
● Sidenote: many modern transformers are described and discussed in
our Telegram channel & chat on ML research papers:
https://t.me/gonzo_ML
Prerequisites
5. Transformer
A new simple network architecture,
the Transformer:
● Is a Encoder-Decoder architecture
● Based solely on attention mechanisms
(no RNN/CNN)
● The major component in the
transformer is the unit of multi-head
self-attention mechanism.
● Fast: only matrix multiplications
● Strong results on standard WMT
datasets
8. The transformer adopts the scaled
dot-product attention: the output is a
weighted sum of the values, where the
weight assigned to each value is
determined by the dot-product of the
query with all the keys:
The input consists of queries and keys of
dimension dk
, and values of dimension dv
.
Scaled dot-product attention
10. Problems with vanilla transformers
● It’s a pretty heavy model
→ hard to train, tricky training
schedule (warm-ups, cyclic
learning rates, etc)
● O(N2
) computational
complexity attention
mechanism → scales poorly
● limited context span (mostly
due to the complexity),
typically 512 tokens
→ can’t process long sequences.
● May need different implicit bias for other types of data (e.g. image,
sound, etc)
25. ● Size Doesn’t Guarantee Diversity
○ Internet data overrepresenting younger users and those from developed countries.
○ Training data is sourced by scraping only specific sites (e.g. Reddit).
○ There are structural factors including moderation practices.
○ The current practice of filtering datasets can further attenuate specific voices.
● Static Data/Changing Social Views
○ The risk of ‘value-lock’, where the LM-reliant technology reifies older, less-inclusive
understandings.
○ Movements with no significant media attention will not be captured at all.
○ Given the compute costs it likely isn’t feasible to fully retrain LMs frequently enough.
● Encoding Bias
○ Large LMs exhibit various kinds of bias, including stereotypical associations or
negative sentiment towards specific groups.
○ Issues with training data: unreliable news sites, banned subreddits, etc.
○ Model auditing using automated systems that are not reliable themselves.
● Documentation debt
○ Datasets are both undocumented and too large to document post hoc.
26. “An LM is a system for haphazardly stitching together
sequences of linguistic forms it has observed in its vast
training data, according to probabilistic information
about how they combine, but without any reference to
meaning: a stochastic parrot. “
https://dl.acm.org/doi/10.1145/3442188.3445922
27. https://crfm.stanford.edu/
In recent years, a new successful paradigm for building AI systems has
emerged: Train one model on a huge amount of data and adapt it to
many applications. We call such a model a foundation model.
Foundation models (e.g., GPT-3) have demonstrated impressive behavior,
but can fail unexpectedly, harbor biases, and are poorly understood.
Nonetheless, they are being deployed at scale.
The Center for Research on Foundation Models (CRFM) is an
interdisciplinary initiative born out of the Stanford Institute for
Human-Centered Artificial Intelligence (HAI) that aims to make
fundamental advances in the study, development, and deployment of
foundation models.
34. Some recent architectural innovations
Switch Transformers:
Mixture of Experts (MoE)
architecture with only a single
expert per feed-forward layer.
Scales well with more experts.
Adds a new dimension of
scaling: ‘expert-parallelism’ in
addition to data- and
model-parallelism.
“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”
https://arxiv.org/abs/2101.03961
35. Some recent architectural innovations
Balanced assignment of experts
(BASE) layer:
A new kind of sparse expert model (similar
to MoE transformer or Switch transformer)
that algorithmically balance the
token-to-expert assignments (without any
new hyperparameters or auxiliary losses).
Distributes well across many GPUs (say,
128).
“BASE Layers: Simplifying Training of Large, Sparse Models”
https://arxiv.org/abs/2103.16716
36. Some recent architectural innovations
A simple yet highly accurate
approximation for vanilla attention:
● its memory usage is linear in the
input size, similar to linear attention
variants, such as Performer and RFA
● it is a drop-in replacement for vanilla
attention that does not require any
corrective pre-training
● it can also lead to significant memory
savings in the feed-forward layers after
casting them into the familiar
query-key-value framework.
“Memory-efficient Transformers via Top-k Attention”
https://arxiv.org/abs/2106.06899
37. Some recent architectural innovations
Expire-Span Transformer:
● learns to retain the most important
information and expire the irrelevant
information
● scales to attend over tens of
thousands of previous timesteps
efficiently, as not all states from
previous timesteps are preserved
“Not All Memories are Created Equal: Learning to Forget by Expiring”
https://arxiv.org/abs/2105.06548
39. Image Transformers
There were many transformers for images already:
● Image Transformer (https://arxiv.org/abs/1802.05751)
● Sparse Transformer
(https://arxiv.org/abs/1904.10509)
● Image GPT (iGPT): just a GPT-2 trained on images
unrolled into long sequences of pixels
(https://openai.com/blog/image-gpt/)
● Axial Transformer: for images and other data
organized as high dim tensors
(https://arxiv.org/abs/1912.12180).
42. Some New Transformers for Images
“Bottleneck Transformers for Visual Recognition”
https://arxiv.org/abs/2101.11605
43. Vision Transformer (ViT)
● Image is split into patches (e.g. 16x16), flatten into a 1D sequence, then put
into a transformer encoder (similar to BERT).
“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”
https://arxiv.org/abs/2010.11929
44. Data-efficient image Transformer (DeiT)
The architecture is identical to ViT with the
only differences are the training strategies,
and the distillation token.
“Training data-efficient image transformers & distillation through attention”
https://arxiv.org/abs/2012.12877
45. Bottleneck Transformers (BoTNet)
● A hybrid model with ResNet +
Transformer
● Replacing internal 3x3 convolutions
inside a ResNet block (only the last
three) with Multi-head Self-Attention.
● The architecture called BoTNet scales
pretty well.
“Bottleneck Transformers for Visual Recognition”
https://arxiv.org/abs/2101.11605
46. Vision MoE (V-MoE)
● A sparse variant of the recent Vision Transformer (ViT) architecture for image
classification.
● .The V-MoE replaces a subset of the dense feedforward layers in ViT with
sparse MoE layers, where each image patch is “routed” to a subset of
“experts” (MLPs).
● Scales to model sizes of 15B parameters, the largest vision models to date.
“Scaling Vision with Sparse Mixture of Experts”
https://arxiv.org/abs/2106.05974
47. Speech and Sound Transformers
There were many transformers for sound as well:
● Speech-Transformer (https://ieeexplore.ieee.org/document/8462506)
● Conformer (https://arxiv.org/abs/2005.08100)
● Transformer-Transducer (https://arxiv.org/abs/1910.12977)
● Transformer-Transducer(https://arxiv.org/abs/2002.02562)
● Conv-Transformer Transducer (https://arxiv.org/abs/2008.05750)
● Speech-XLNet (https://arxiv.org/abs/1910.10387)
● Audio ALBERT (https://arxiv.org/abs/2005.08575)
● Emformer (https://arxiv.org/abs/2010.10759)
● wav2vec 2.0 (https://arxiv.org/abs/2006.11477)
● ...
48. AST: Audio Spectrogram Transformer
“AST: Audio Spectrogram Transformer”
https://arxiv.org/abs/2104.01778
A convolution-free, purely attention-based
model for audio classification.
Very close to ViT, but AST can process
variable-length audio inputs.
49. ACT: Audio Captioning Transformer
“Audio Captioning Transformer”
https://arxiv.org/abs/2107.09817
Another convolution-free Transformer
based on an encoder-decoder
architecture.
50. Multi-channel Transformer for ASR
“End-to-End Multi-Channel Transformer for Speech Recognition”
https://arxiv.org/abs/2102.03951
51. Transformers in Biology
Finally transformers came into biology!
● ESM-1b protein language model
(https://www.pnas.org/content/118/15/e2016239118)
● MSA Transformer for multiple sequence alignment
(https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1)
● RoseTTAFold for predicting protein structures (includes graph
transformers)
(https://www.science.org/doi/abs/10.1126/science.abj8754)
● AlphaFold2 for predicting protein structures
(https://www.nature.com/articles/s41586-021-03819-2)
52. ESM-1b
“Biological structure and function emerge from scaling unsupervised learning to 250 million protein
sequences”, https://www.pnas.org/content/118/15/e2016239118
53. RoseTTAFold
“Accurate prediction of protein structures and interactions using a 3-track network”
https://www.science.org/doi/abs/10.1126/science.abj8754
54. AlphaFold 2
“Highly accurate protein structure prediction with AlphaFold”
https://www.nature.com/articles/s41586-021-03819-2
55. AlphaFold 2: Evoformer block
“Highly accurate protein structure prediction with AlphaFold”
https://www.nature.com/articles/s41586-021-03819-2
58. DALL·E (OpenAI)
“Zero-Shot Text-to-Image Generation”
https://arxiv.org/abs/2102.12092
A model trained on images+text
descriptions.
Autoregressively generates image tokens
based on previous text and (optionally)
image tokens.
Technically a transformer decoder.
Image tokens are obtained with a
pretrained dVAE.
Candidates are ranked using CLIP.
59. CLIP (OpenAI)
“Learning Transferable Visual Models From Natural Language Supervision”
https://arxiv.org/abs/2103.00020
Uses contrastive pre-training to predict which caption goes with which image.
61. CLIPDraw
“CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders”
https://arxiv.org/abs/2106.14843
You can optimize the image to better match a text description (remember
DeepDream?).