SlideShare a Scribd company logo
1 of 67
Download to read offline
Transformers in 2021
Grigory Sapunov
DataFest Yerevan 2021
10.09.2021
gs@inten.to
Who am I?
● MD in CS (2002), PhD in AI (2006)
● ex-Yandex News Dev. Team Leader (2007-2012)
● CTO & co-founder of Intento (2016+) and
Berkeley SkyDeck alumni (Spring 2019)
● Member of Scientific Advisory Board
at Atlas Biomed
● Google Developer Expert in
Machine Learning
● Transformer architecture understanding
○ Original paper: https://arxiv.org/abs/1706.03762
○ Great visual explanation: http://jalammar.github.io/illustrated-transformer
○ Lecture #12 from my DL course
https://github.com/che-shr-cat/deep-learning-for-biology-hse-2019-course
● This talk is in some sense a follow-up talk for these two:
○ https://www.youtube.com/watch?v=KZ9NXYcXVBY (GDG DevParty)
○ https://www.youtube.com/watch?v=7e4LxIVENZA (GDG DevFest)
● Sidenote: many modern transformers are described and discussed in
our Telegram channel & chat on ML research papers:
https://t.me/gonzo_ML
Prerequisites
Recap: Transformer Architecture
Transformer
A new simple network architecture,
the Transformer:
● Is a Encoder-Decoder architecture
● Based solely on attention mechanisms
(no RNN/CNN)
● The major component in the
transformer is the unit of multi-head
self-attention mechanism.
● Fast: only matrix multiplications
● Strong results on standard WMT
datasets
Multi-head self-attention mechanism
Essentially, the Multi-Head Attention is just
several attention layers stacked together
with different linear transformations of the
same input.
The transformer adopts the scaled
dot-product attention: the output is a
weighted sum of the values, where the
weight assigned to each value is
determined by the dot-product of the
query with all the keys:
The input consists of queries and keys of
dimension dk
, and values of dimension dv
.
Scaled dot-product attention
Quadratic attention
Efficient Transformers: A Survey
https://arxiv.org/abs/2009.06732
Problems with vanilla transformers
● It’s a pretty heavy model
→ hard to train, tricky training
schedule (warm-ups, cyclic
learning rates, etc)
● O(N2
) computational
complexity attention
mechanism → scales poorly
● limited context span (mostly
due to the complexity),
typically 512 tokens
→ can’t process long sequences.
● May need different implicit bias for other types of data (e.g. image,
sound, etc)
Year 2021 directions
Directions in 2021
● (Still) Large transformers
● (Still) Efficient transformers
● New modalities:
○ more image transformers
○ audio transformers
○ transformers in biology and other domains (graphs)
● Multimodalily: CLIP, DALLE, Performer + IO, …
● Artistic applications: CLIPDraw etc
1. Large Transformers
Large models
http://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf
Large models in 2021
● (English) GPT-Neo (2.7B), GPT-J (6B),
Jurassic-1 (7.5B/178B)
● (Russian) ruGPT-3 (13B)
● (Chinese) CPM-2 (11B/198B* - MoE),
M6 (10B/100B), Wu Dao 2.0 (1.75T*),
PangGu-α (2.6B/13B/207B)
● (Korean) HyperCLOVA (204B)
● (Code) OpenAI Codex (12B),
Google’s (up to 137B)
● ByT5 (up to 12.9B)
● XLM-R XL/XXL (3.5B/10.7B)
● DeBERTa (1.5B)
● Switch Transformer (1.6T*)
● ERNIE 3.0 (10B)
● DALL·E (12B)
● Vision MoE (14.7B*)
Scaling laws
“Scaling Laws for Neural Language Models”
https://arxiv.org/abs/2001.08361
Scaling laws
“Scaling Laws for Neural Language Models”
https://arxiv.org/abs/2001.08361
SuperGLUE
https://super.gluebenchmark.com/leaderboard
1*. Problems of Large Models
Costs
Large model training costs
“The Cost of Training NLP Models: A Concise Overview”
https://arxiv.org/abs/2004.08900
CO2
emissions
“Energy and Policy Considerations for Deep Learning in NLP”
https://arxiv.org/abs/1906.02243
Training Data Extraction
“Extracting Training Data from Large Language Models”
https://arxiv.org/abs/2012.07805
https://dl.acm.org/doi/10.1145/3442188.3445922
● Size Doesn’t Guarantee Diversity
○ Internet data overrepresenting younger users and those from developed countries.
○ Training data is sourced by scraping only specific sites (e.g. Reddit).
○ There are structural factors including moderation practices.
○ The current practice of filtering datasets can further attenuate specific voices.
● Static Data/Changing Social Views
○ The risk of ‘value-lock’, where the LM-reliant technology reifies older, less-inclusive
understandings.
○ Movements with no significant media attention will not be captured at all.
○ Given the compute costs it likely isn’t feasible to fully retrain LMs frequently enough.
● Encoding Bias
○ Large LMs exhibit various kinds of bias, including stereotypical associations or
negative sentiment towards specific groups.
○ Issues with training data: unreliable news sites, banned subreddits, etc.
○ Model auditing using automated systems that are not reliable themselves.
● Documentation debt
○ Datasets are both undocumented and too large to document post hoc.
“An LM is a system for haphazardly stitching together
sequences of linguistic forms it has observed in its vast
training data, according to probabilistic information
about how they combine, but without any reference to
meaning: a stochastic parrot. “
https://dl.acm.org/doi/10.1145/3442188.3445922
https://crfm.stanford.edu/
In recent years, a new successful paradigm for building AI systems has
emerged: Train one model on a huge amount of data and adapt it to
many applications. We call such a model a foundation model.
Foundation models (e.g., GPT-3) have demonstrated impressive behavior,
but can fail unexpectedly, harbor biases, and are poorly understood.
Nonetheless, they are being deployed at scale.
The Center for Research on Foundation Models (CRFM) is an
interdisciplinary initiative born out of the Stanford Institute for
Human-Centered Artificial Intelligence (HAI) that aims to make
fundamental advances in the study, development, and deployment of
foundation models.
https://arxiv.org/abs/2108.07258
https://arxiv.org/abs/2108.07258
https://arxiv.org/abs/2108.07258
2. Efficient Transformers
“Efficient Transformers: A Survey”
https://arxiv.org/abs/2009.06732
“Efficient Transformers: A Survey”
https://arxiv.org/abs/2009.06732
Some recent architectural innovations
Switch Transformers:
Mixture of Experts (MoE)
architecture with only a single
expert per feed-forward layer.
Scales well with more experts.
Adds a new dimension of
scaling: ‘expert-parallelism’ in
addition to data- and
model-parallelism.
“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”
https://arxiv.org/abs/2101.03961
Some recent architectural innovations
Balanced assignment of experts
(BASE) layer:
A new kind of sparse expert model (similar
to MoE transformer or Switch transformer)
that algorithmically balance the
token-to-expert assignments (without any
new hyperparameters or auxiliary losses).
Distributes well across many GPUs (say,
128).
“BASE Layers: Simplifying Training of Large, Sparse Models”
https://arxiv.org/abs/2103.16716
Some recent architectural innovations
A simple yet highly accurate
approximation for vanilla attention:
● its memory usage is linear in the
input size, similar to linear attention
variants, such as Performer and RFA
● it is a drop-in replacement for vanilla
attention that does not require any
corrective pre-training
● it can also lead to significant memory
savings in the feed-forward layers after
casting them into the familiar
query-key-value framework.
“Memory-efficient Transformers via Top-k Attention”
https://arxiv.org/abs/2106.06899
Some recent architectural innovations
Expire-Span Transformer:
● learns to retain the most important
information and expire the irrelevant
information
● scales to attend over tens of
thousands of previous timesteps
efficiently, as not all states from
previous timesteps are preserved
“Not All Memories are Created Equal: Learning to Forget by Expiring”
https://arxiv.org/abs/2105.06548
3. New Modalities
Image Transformers
There were many transformers for images already:
● Image Transformer (https://arxiv.org/abs/1802.05751)
● Sparse Transformer
(https://arxiv.org/abs/1904.10509)
● Image GPT (iGPT): just a GPT-2 trained on images
unrolled into long sequences of pixels
(https://openai.com/blog/image-gpt/)
● Axial Transformer: for images and other data
organized as high dim tensors
(https://arxiv.org/abs/1912.12180).
Image Transformers
Many more emerged in 2020-2021:
● Vision Transformer (ViT)
● Data-efficient image
Transformer (DeiT)
● Bottleneck Transformers (BoTNet)
● Vision MoE (V-MoE)
● Image Processing Transformer (IPT)
● Detection Transformer (DETR)
● TransGAN
● ...
“Transformers in Vision: A Survey”
https://arxiv.org/abs/2101.01169
Some New Transformers for Images
“Bottleneck Transformers for Visual Recognition”
https://arxiv.org/abs/2101.11605
Vision Transformer (ViT)
● Image is split into patches (e.g. 16x16), flatten into a 1D sequence, then put
into a transformer encoder (similar to BERT).
“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”
https://arxiv.org/abs/2010.11929
Data-efficient image Transformer (DeiT)
The architecture is identical to ViT with the
only differences are the training strategies,
and the distillation token.
“Training data-efficient image transformers & distillation through attention”
https://arxiv.org/abs/2012.12877
Bottleneck Transformers (BoTNet)
● A hybrid model with ResNet +
Transformer
● Replacing internal 3x3 convolutions
inside a ResNet block (only the last
three) with Multi-head Self-Attention.
● The architecture called BoTNet scales
pretty well.
“Bottleneck Transformers for Visual Recognition”
https://arxiv.org/abs/2101.11605
Vision MoE (V-MoE)
● A sparse variant of the recent Vision Transformer (ViT) architecture for image
classification.
● .The V-MoE replaces a subset of the dense feedforward layers in ViT with
sparse MoE layers, where each image patch is “routed” to a subset of
“experts” (MLPs).
● Scales to model sizes of 15B parameters, the largest vision models to date.
“Scaling Vision with Sparse Mixture of Experts”
https://arxiv.org/abs/2106.05974
Speech and Sound Transformers
There were many transformers for sound as well:
● Speech-Transformer (https://ieeexplore.ieee.org/document/8462506)
● Conformer (https://arxiv.org/abs/2005.08100)
● Transformer-Transducer (https://arxiv.org/abs/1910.12977)
● Transformer-Transducer(https://arxiv.org/abs/2002.02562)
● Conv-Transformer Transducer (https://arxiv.org/abs/2008.05750)
● Speech-XLNet (https://arxiv.org/abs/1910.10387)
● Audio ALBERT (https://arxiv.org/abs/2005.08575)
● Emformer (https://arxiv.org/abs/2010.10759)
● wav2vec 2.0 (https://arxiv.org/abs/2006.11477)
● ...
AST: Audio Spectrogram Transformer
“AST: Audio Spectrogram Transformer”
https://arxiv.org/abs/2104.01778
A convolution-free, purely attention-based
model for audio classification.
Very close to ViT, but AST can process
variable-length audio inputs.
ACT: Audio Captioning Transformer
“Audio Captioning Transformer”
https://arxiv.org/abs/2107.09817
Another convolution-free Transformer
based on an encoder-decoder
architecture.
Multi-channel Transformer for ASR
“End-to-End Multi-Channel Transformer for Speech Recognition”
https://arxiv.org/abs/2102.03951
Transformers in Biology
Finally transformers came into biology!
● ESM-1b protein language model
(https://www.pnas.org/content/118/15/e2016239118)
● MSA Transformer for multiple sequence alignment
(https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1)
● RoseTTAFold for predicting protein structures (includes graph
transformers)
(https://www.science.org/doi/abs/10.1126/science.abj8754)
● AlphaFold2 for predicting protein structures
(https://www.nature.com/articles/s41586-021-03819-2)
ESM-1b
“Biological structure and function emerge from scaling unsupervised learning to 250 million protein
sequences”, https://www.pnas.org/content/118/15/e2016239118
RoseTTAFold
“Accurate prediction of protein structures and interactions using a 3-track network”
https://www.science.org/doi/abs/10.1126/science.abj8754
AlphaFold 2
“Highly accurate protein structure prediction with AlphaFold”
https://www.nature.com/articles/s41586-021-03819-2
AlphaFold 2: Evoformer block
“Highly accurate protein structure prediction with AlphaFold”
https://www.nature.com/articles/s41586-021-03819-2
4. Multi-Modal Transformers
https://arxiv.org/abs/2101.01169
DALL·E (OpenAI)
“Zero-Shot Text-to-Image Generation”
https://arxiv.org/abs/2102.12092
A model trained on images+text
descriptions.
Autoregressively generates image tokens
based on previous text and (optionally)
image tokens.
Technically a transformer decoder.
Image tokens are obtained with a
pretrained dVAE.
Candidates are ranked using CLIP.
CLIP (OpenAI)
“Learning Transferable Visual Models From Natural Language Supervision”
https://arxiv.org/abs/2103.00020
Uses contrastive pre-training to predict which caption goes with which image.
ALIGN (Google)
https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html
“Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision”
https://arxiv.org/abs/2102.05918
Train EfficientNet-L2 (image encoder) and BERT-large (text encoder) with a
contrastive loss on a huge noisy dataset (1.8B image-text pairs).
CLIPDraw
“CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders”
https://arxiv.org/abs/2106.14843
You can optimize the image to better match a text description (remember
DeepDream?).
CLIPDraw
“CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders”
https://arxiv.org/abs/2106.14843
The image is rendered from a set of bezier curves.
https://twitter.com/RiversHaveWings/status/1410020043178446848
“a beautiful epic wondrous fantasy painting of the ocean”
CLIP + PixelDraw
https://www.reddit.com/r/MediaSynthesis/comments/pf7ru8/set_of_asianthemed_graphics_generated_with_clipit/
Perceiver (Google)
“Perceiver: General Perception with Iterative Attention”
https://arxiv.org/abs/2103.03206
Perceiver IO (Google)
“Perceiver IO: A General Architecture for Structured Inputs & Outputs”
https://arxiv.org/abs/2107.14795
https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!

More Related Content

What's hot

And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language ModelsLeon Dohmen
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsOVHcloud
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)H K Yoon
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMsSylvainGugger
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentationbhavesh_physics
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Hady Elsahar
 
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!taozen
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfDavid Rostcheck
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion pathVitaly Bondar
 
LanGCHAIN Framework
LanGCHAIN FrameworkLanGCHAIN Framework
LanGCHAIN FrameworkKeymate.AI
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language modelJiWenKim
 

What's hot (20)

And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe
 
Intro to LLMs
Intro to LLMsIntro to LLMs
Intro to LLMs
 
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdf
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion path
 
LanGCHAIN Framework
LanGCHAIN FrameworkLanGCHAIN Framework
LanGCHAIN Framework
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 

Similar to Transformers in 2021

Leveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case Study
Leveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case StudyLeveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case Study
Leveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case StudyLuca Berardinelli
 
The Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing SystemsThe Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing SystemsNeo4j
 
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...Anita Graser
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolThierry Badard
 
Model Execution: Past, Present and Future
Model Execution: Past, Present and FutureModel Execution: Past, Present and Future
Model Execution: Past, Present and FutureBenoit Combemale
 
Museum Linked Open Data: Ontologies, Datasets, Projects
Museum Linked Open Data: Ontologies, Datasets, Projects Museum Linked Open Data: Ontologies, Datasets, Projects
Museum Linked Open Data: Ontologies, Datasets, Projects Vladimir Alexiev, PhD, PMP
 
BigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigData_Europe
 
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...tiberiusp
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSAPRBETTER
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deakin University
 
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case eProsima
 
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD CloudAnalyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD CloudMOVING Project
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Enrico Daga
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolThierry Badard
 

Similar to Transformers in 2021 (20)

Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 
Leveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case Study
Leveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case StudyLeveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case Study
Leveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case Study
 
The Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing SystemsThe Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing Systems
 
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 
Model Execution: Past, Present and Future
Model Execution: Past, Present and FutureModel Execution: Past, Present and Future
Model Execution: Past, Present and Future
 
Museum Linked Open Data: Ontologies, Datasets, Projects
Museum Linked Open Data: Ontologies, Datasets, Projects Museum Linked Open Data: Ontologies, Datasets, Projects
Museum Linked Open Data: Ontologies, Datasets, Projects
 
BigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE Platform
 
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
 
NLP in 2020
NLP in 2020NLP in 2020
NLP in 2020
 
Linked Open Data and Ontotext Projects
Linked Open Data and Ontotext ProjectsLinked Open Data and Ontotext Projects
Linked Open Data and Ontotext Projects
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Sebastian Hellmann
Sebastian HellmannSebastian Hellmann
Sebastian Hellmann
 
Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1
 
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
 
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD CloudAnalyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 

More from Grigory Sapunov

AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021Grigory Sapunov
 
What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)Grigory Sapunov
 
Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]Grigory Sapunov
 
Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Grigory Sapunov
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware LandscapeGrigory Sapunov
 
Modern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 versionModern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 versionGrigory Sapunov
 
AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)Grigory Sapunov
 
Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Grigory Sapunov
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Grigory Sapunov
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNsGrigory Sapunov
 
Введение в Deep Learning
Введение в Deep LearningВведение в Deep Learning
Введение в Deep LearningGrigory Sapunov
 
Введение в машинное обучение
Введение в машинное обучениеВведение в машинное обучение
Введение в машинное обучениеGrigory Sapunov
 
Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016Grigory Sapunov
 
Artificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and FutureArtificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and FutureGrigory Sapunov
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Grigory Sapunov
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingGrigory Sapunov
 
Computer Vision and Deep Learning
Computer Vision and Deep LearningComputer Vision and Deep Learning
Computer Vision and Deep LearningGrigory Sapunov
 

More from Grigory Sapunov (20)

AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)
 
Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]
 
Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
BERTology meets Biology
BERTology meets BiologyBERTology meets Biology
BERTology meets Biology
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
Modern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 versionModern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 version
 
AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)
 
Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNs
 
Введение в Deep Learning
Введение в Deep LearningВведение в Deep Learning
Введение в Deep Learning
 
Введение в машинное обучение
Введение в машинное обучениеВведение в машинное обучение
Введение в машинное обучение
 
Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016
 
Artificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and FutureArtificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and Future
 
Multidimensional RNN
Multidimensional RNNMultidimensional RNN
Multidimensional RNN
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image Processing
 
Computer Vision and Deep Learning
Computer Vision and Deep LearningComputer Vision and Deep Learning
Computer Vision and Deep Learning
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Transformers in 2021

  • 1. Transformers in 2021 Grigory Sapunov DataFest Yerevan 2021 10.09.2021 gs@inten.to
  • 2. Who am I? ● MD in CS (2002), PhD in AI (2006) ● ex-Yandex News Dev. Team Leader (2007-2012) ● CTO & co-founder of Intento (2016+) and Berkeley SkyDeck alumni (Spring 2019) ● Member of Scientific Advisory Board at Atlas Biomed ● Google Developer Expert in Machine Learning
  • 3. ● Transformer architecture understanding ○ Original paper: https://arxiv.org/abs/1706.03762 ○ Great visual explanation: http://jalammar.github.io/illustrated-transformer ○ Lecture #12 from my DL course https://github.com/che-shr-cat/deep-learning-for-biology-hse-2019-course ● This talk is in some sense a follow-up talk for these two: ○ https://www.youtube.com/watch?v=KZ9NXYcXVBY (GDG DevParty) ○ https://www.youtube.com/watch?v=7e4LxIVENZA (GDG DevFest) ● Sidenote: many modern transformers are described and discussed in our Telegram channel & chat on ML research papers: https://t.me/gonzo_ML Prerequisites
  • 5. Transformer A new simple network architecture, the Transformer: ● Is a Encoder-Decoder architecture ● Based solely on attention mechanisms (no RNN/CNN) ● The major component in the transformer is the unit of multi-head self-attention mechanism. ● Fast: only matrix multiplications ● Strong results on standard WMT datasets
  • 6.
  • 7. Multi-head self-attention mechanism Essentially, the Multi-Head Attention is just several attention layers stacked together with different linear transformations of the same input.
  • 8. The transformer adopts the scaled dot-product attention: the output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product of the query with all the keys: The input consists of queries and keys of dimension dk , and values of dimension dv . Scaled dot-product attention
  • 9. Quadratic attention Efficient Transformers: A Survey https://arxiv.org/abs/2009.06732
  • 10. Problems with vanilla transformers ● It’s a pretty heavy model → hard to train, tricky training schedule (warm-ups, cyclic learning rates, etc) ● O(N2 ) computational complexity attention mechanism → scales poorly ● limited context span (mostly due to the complexity), typically 512 tokens → can’t process long sequences. ● May need different implicit bias for other types of data (e.g. image, sound, etc)
  • 12. Directions in 2021 ● (Still) Large transformers ● (Still) Efficient transformers ● New modalities: ○ more image transformers ○ audio transformers ○ transformers in biology and other domains (graphs) ● Multimodalily: CLIP, DALLE, Performer + IO, … ● Artistic applications: CLIPDraw etc
  • 15. Large models in 2021 ● (English) GPT-Neo (2.7B), GPT-J (6B), Jurassic-1 (7.5B/178B) ● (Russian) ruGPT-3 (13B) ● (Chinese) CPM-2 (11B/198B* - MoE), M6 (10B/100B), Wu Dao 2.0 (1.75T*), PangGu-α (2.6B/13B/207B) ● (Korean) HyperCLOVA (204B) ● (Code) OpenAI Codex (12B), Google’s (up to 137B) ● ByT5 (up to 12.9B) ● XLM-R XL/XXL (3.5B/10.7B) ● DeBERTa (1.5B) ● Switch Transformer (1.6T*) ● ERNIE 3.0 (10B) ● DALL·E (12B) ● Vision MoE (14.7B*)
  • 16. Scaling laws “Scaling Laws for Neural Language Models” https://arxiv.org/abs/2001.08361
  • 17. Scaling laws “Scaling Laws for Neural Language Models” https://arxiv.org/abs/2001.08361
  • 19. 1*. Problems of Large Models
  • 20. Costs
  • 21. Large model training costs “The Cost of Training NLP Models: A Concise Overview” https://arxiv.org/abs/2004.08900
  • 22. CO2 emissions “Energy and Policy Considerations for Deep Learning in NLP” https://arxiv.org/abs/1906.02243
  • 23. Training Data Extraction “Extracting Training Data from Large Language Models” https://arxiv.org/abs/2012.07805
  • 25. ● Size Doesn’t Guarantee Diversity ○ Internet data overrepresenting younger users and those from developed countries. ○ Training data is sourced by scraping only specific sites (e.g. Reddit). ○ There are structural factors including moderation practices. ○ The current practice of filtering datasets can further attenuate specific voices. ● Static Data/Changing Social Views ○ The risk of ‘value-lock’, where the LM-reliant technology reifies older, less-inclusive understandings. ○ Movements with no significant media attention will not be captured at all. ○ Given the compute costs it likely isn’t feasible to fully retrain LMs frequently enough. ● Encoding Bias ○ Large LMs exhibit various kinds of bias, including stereotypical associations or negative sentiment towards specific groups. ○ Issues with training data: unreliable news sites, banned subreddits, etc. ○ Model auditing using automated systems that are not reliable themselves. ● Documentation debt ○ Datasets are both undocumented and too large to document post hoc.
  • 26. “An LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot. “ https://dl.acm.org/doi/10.1145/3442188.3445922
  • 27. https://crfm.stanford.edu/ In recent years, a new successful paradigm for building AI systems has emerged: Train one model on a huge amount of data and adapt it to many applications. We call such a model a foundation model. Foundation models (e.g., GPT-3) have demonstrated impressive behavior, but can fail unexpectedly, harbor biases, and are poorly understood. Nonetheless, they are being deployed at scale. The Center for Research on Foundation Models (CRFM) is an interdisciplinary initiative born out of the Stanford Institute for Human-Centered Artificial Intelligence (HAI) that aims to make fundamental advances in the study, development, and deployment of foundation models.
  • 32. “Efficient Transformers: A Survey” https://arxiv.org/abs/2009.06732
  • 33. “Efficient Transformers: A Survey” https://arxiv.org/abs/2009.06732
  • 34. Some recent architectural innovations Switch Transformers: Mixture of Experts (MoE) architecture with only a single expert per feed-forward layer. Scales well with more experts. Adds a new dimension of scaling: ‘expert-parallelism’ in addition to data- and model-parallelism. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity” https://arxiv.org/abs/2101.03961
  • 35. Some recent architectural innovations Balanced assignment of experts (BASE) layer: A new kind of sparse expert model (similar to MoE transformer or Switch transformer) that algorithmically balance the token-to-expert assignments (without any new hyperparameters or auxiliary losses). Distributes well across many GPUs (say, 128). “BASE Layers: Simplifying Training of Large, Sparse Models” https://arxiv.org/abs/2103.16716
  • 36. Some recent architectural innovations A simple yet highly accurate approximation for vanilla attention: ● its memory usage is linear in the input size, similar to linear attention variants, such as Performer and RFA ● it is a drop-in replacement for vanilla attention that does not require any corrective pre-training ● it can also lead to significant memory savings in the feed-forward layers after casting them into the familiar query-key-value framework. “Memory-efficient Transformers via Top-k Attention” https://arxiv.org/abs/2106.06899
  • 37. Some recent architectural innovations Expire-Span Transformer: ● learns to retain the most important information and expire the irrelevant information ● scales to attend over tens of thousands of previous timesteps efficiently, as not all states from previous timesteps are preserved “Not All Memories are Created Equal: Learning to Forget by Expiring” https://arxiv.org/abs/2105.06548
  • 39. Image Transformers There were many transformers for images already: ● Image Transformer (https://arxiv.org/abs/1802.05751) ● Sparse Transformer (https://arxiv.org/abs/1904.10509) ● Image GPT (iGPT): just a GPT-2 trained on images unrolled into long sequences of pixels (https://openai.com/blog/image-gpt/) ● Axial Transformer: for images and other data organized as high dim tensors (https://arxiv.org/abs/1912.12180).
  • 40. Image Transformers Many more emerged in 2020-2021: ● Vision Transformer (ViT) ● Data-efficient image Transformer (DeiT) ● Bottleneck Transformers (BoTNet) ● Vision MoE (V-MoE) ● Image Processing Transformer (IPT) ● Detection Transformer (DETR) ● TransGAN ● ...
  • 41. “Transformers in Vision: A Survey” https://arxiv.org/abs/2101.01169
  • 42. Some New Transformers for Images “Bottleneck Transformers for Visual Recognition” https://arxiv.org/abs/2101.11605
  • 43. Vision Transformer (ViT) ● Image is split into patches (e.g. 16x16), flatten into a 1D sequence, then put into a transformer encoder (similar to BERT). “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” https://arxiv.org/abs/2010.11929
  • 44. Data-efficient image Transformer (DeiT) The architecture is identical to ViT with the only differences are the training strategies, and the distillation token. “Training data-efficient image transformers & distillation through attention” https://arxiv.org/abs/2012.12877
  • 45. Bottleneck Transformers (BoTNet) ● A hybrid model with ResNet + Transformer ● Replacing internal 3x3 convolutions inside a ResNet block (only the last three) with Multi-head Self-Attention. ● The architecture called BoTNet scales pretty well. “Bottleneck Transformers for Visual Recognition” https://arxiv.org/abs/2101.11605
  • 46. Vision MoE (V-MoE) ● A sparse variant of the recent Vision Transformer (ViT) architecture for image classification. ● .The V-MoE replaces a subset of the dense feedforward layers in ViT with sparse MoE layers, where each image patch is “routed” to a subset of “experts” (MLPs). ● Scales to model sizes of 15B parameters, the largest vision models to date. “Scaling Vision with Sparse Mixture of Experts” https://arxiv.org/abs/2106.05974
  • 47. Speech and Sound Transformers There were many transformers for sound as well: ● Speech-Transformer (https://ieeexplore.ieee.org/document/8462506) ● Conformer (https://arxiv.org/abs/2005.08100) ● Transformer-Transducer (https://arxiv.org/abs/1910.12977) ● Transformer-Transducer(https://arxiv.org/abs/2002.02562) ● Conv-Transformer Transducer (https://arxiv.org/abs/2008.05750) ● Speech-XLNet (https://arxiv.org/abs/1910.10387) ● Audio ALBERT (https://arxiv.org/abs/2005.08575) ● Emformer (https://arxiv.org/abs/2010.10759) ● wav2vec 2.0 (https://arxiv.org/abs/2006.11477) ● ...
  • 48. AST: Audio Spectrogram Transformer “AST: Audio Spectrogram Transformer” https://arxiv.org/abs/2104.01778 A convolution-free, purely attention-based model for audio classification. Very close to ViT, but AST can process variable-length audio inputs.
  • 49. ACT: Audio Captioning Transformer “Audio Captioning Transformer” https://arxiv.org/abs/2107.09817 Another convolution-free Transformer based on an encoder-decoder architecture.
  • 50. Multi-channel Transformer for ASR “End-to-End Multi-Channel Transformer for Speech Recognition” https://arxiv.org/abs/2102.03951
  • 51. Transformers in Biology Finally transformers came into biology! ● ESM-1b protein language model (https://www.pnas.org/content/118/15/e2016239118) ● MSA Transformer for multiple sequence alignment (https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1) ● RoseTTAFold for predicting protein structures (includes graph transformers) (https://www.science.org/doi/abs/10.1126/science.abj8754) ● AlphaFold2 for predicting protein structures (https://www.nature.com/articles/s41586-021-03819-2)
  • 52. ESM-1b “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences”, https://www.pnas.org/content/118/15/e2016239118
  • 53. RoseTTAFold “Accurate prediction of protein structures and interactions using a 3-track network” https://www.science.org/doi/abs/10.1126/science.abj8754
  • 54. AlphaFold 2 “Highly accurate protein structure prediction with AlphaFold” https://www.nature.com/articles/s41586-021-03819-2
  • 55. AlphaFold 2: Evoformer block “Highly accurate protein structure prediction with AlphaFold” https://www.nature.com/articles/s41586-021-03819-2
  • 58. DALL·E (OpenAI) “Zero-Shot Text-to-Image Generation” https://arxiv.org/abs/2102.12092 A model trained on images+text descriptions. Autoregressively generates image tokens based on previous text and (optionally) image tokens. Technically a transformer decoder. Image tokens are obtained with a pretrained dVAE. Candidates are ranked using CLIP.
  • 59. CLIP (OpenAI) “Learning Transferable Visual Models From Natural Language Supervision” https://arxiv.org/abs/2103.00020 Uses contrastive pre-training to predict which caption goes with which image.
  • 60. ALIGN (Google) https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision” https://arxiv.org/abs/2102.05918 Train EfficientNet-L2 (image encoder) and BERT-large (text encoder) with a contrastive loss on a huge noisy dataset (1.8B image-text pairs).
  • 61. CLIPDraw “CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders” https://arxiv.org/abs/2106.14843 You can optimize the image to better match a text description (remember DeepDream?).
  • 62. CLIPDraw “CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders” https://arxiv.org/abs/2106.14843 The image is rendered from a set of bezier curves.
  • 65. Perceiver (Google) “Perceiver: General Perception with Iterative Attention” https://arxiv.org/abs/2103.03206
  • 66. Perceiver IO (Google) “Perceiver IO: A General Architecture for Structured Inputs & Outputs” https://arxiv.org/abs/2107.14795