SlideShare a Scribd company logo
1 of 49
Download to read offline
Transformers and sequence-
to-sequence learning
CS 685, Fall 2021
Mohit Iyyer
College of Information and Computer Sciences
University of Massachusetts Amherst
some slides from Emma Strubell
Attention is great
• Attention significantly improves NMT performance
• It’s very useful to allow decoder to focus on certain parts of the source
• Attention solves the bottleneck problem
• Attention allows decoder to look directly at source; bypass bottleneck
• Attention helps with vanishing gradient problem
• Provides shortcut to faraway states
• Attention provides some interpretability
• By inspecting attention distribution, we can see
what the decoder was focusing on
• We get alignment for free!
• This is cool because we never explicitly trained
an alignment system
• The network just learned alignment by itself
Alignments: hardest
The
poor
don’t
have
any
money
Les
pauvres
sont
démunis
The
poor
don t
have
any
money
Les
pauvres
sont
démunis
From last time
• Project proposals now due 10/1!
• Quiz 2 due Friday
• Can TAs record zoom office hours? Maybe
• How did we get this grid in the previous lecture?
Will explain in today’s class.
• Final proj. reports due Dec. 16th
iPad
sequence-to-sequence learning
• we’ll use French (f) to English (e) as a running
example
• goal: given French sentence f with tokens f1, f2,
… fn produce English translation e with tokens
e1, e2, … em
arg max
e
p(e| f )
• real goal: compute
Used when inputs and outputs are
both sequences of words (e.g.,
machine translation, summarization)
This is an instance of
conditional language modeling
p(e| f ) = p(e1, e2, …, em | f )
= p(e1 | f ) ⋅ p(e2 |e1, f ) ⋅ p(e3 |e2, e1, f ) ⋅ …
=
m
∏
i=1
p(ei |e1, …, ei−1, f )
Just like we’ve seen before, except we
additionally condition our prediction of
the next word on some other input
(here, the French sentence)
seq2seq models
• use two different neural networks to model
• first we have the encoder, which encodes the
French sentence f
• then, we have the decoder, which produces
the English sentence e
6
L
∏
i=1
p(ei |e1, …, ei−1, f )
7
Encoder
RNN
Neural Machine Translation (NMT)
2/15/18
23
<START>
Source sentence (input)
les pauvres sont démunis
The sequence-to-sequence model
Target sentence (output)
Decoder
RNN
Encoder RNN produces
an encoding of the
source sentence.
Encoding of the source sentence.
Provides initial hidden state
for Decoder RNN.
Decoder RNN is a Language Model that generates
target sentence conditioned on encoding.
the
argmax
the
argmax
poor
poor
argmax
don’t
Note: This diagram shows test time behavior:
decoder output is fed in as next step’s input
have any money <END>
don’t have any money
argmax
argmax
argmax
argmax
8
Encoder
RNN
Neural Machine Translation (NMT)
2/15/18
23
<START>
Source sentence (input)
les pauvres sont démunis
The sequence-to-sequence model
Target sentence (output)
Decoder
RNN
Encoder RNN produces
an encoding of the
source sentence.
Encoding of the source sentence.
Provides initial hidden state
for Decoder RNN.
Decoder RNN is a Language Model that generates
target sentence conditioned on encoding.
the
argmax
the
argmax
poor
poor
argmax
don’t
Note: This diagram shows test time behavior:
decoder output is fed in as next step’s input
have any money <END>
don’t have any money
argmax
argmax
argmax
argmax
9
Training a Neural Machine Translation system
2/15/18
25
Encoder
RNN
Source sentence (from corpus)
<START> the poor don’t have any money
les pauvres sont démunis
Target sentence (from corpus)
Seq2seq is optimized as a single system.
Backpropagation operates “end to end”.
Decoder
RNN
!
"# !
"$ !
"% !
"& !
"' !
"( !
")
*# *$ *% *& *' *( *)
= negative log
prob of “the”
* =
1
-
.
/0#
1
*/ = + + + + + +
= negative log
prob of <END>
= negative log
prob of “have”
We’ll talk much more about
machine translation / other seq2seq
problems later… but for now, let’s
go back to the Transformer
encoder
encoder
decoder
encoder
decoder
So far we’ve just talked
about self-attention… what
is all this other stuff?
Self-attention (in encoder)
15
committee awards Strickland advanced optics
who
Layer p
Q
K
V
[Vaswani et al. 2017]
Nobel
Slides by Emma Strubell!
16
Layer p
Q
K
V
[Vaswani et al. 2017]
committee awards Strickland advanced optics
who
Nobel
Slides by Emma Strubell!
Self-attention (in encoder)
17
Layer p
Q
K
V
optics
advanced
who
Strickland
awards
committee
Nobel
[Vaswani et al. 2017]
committee awards Strickland advanced optics
who
Nobel
Slides by Emma Strubell!
Self-attention (in encoder)
optics
advanced
who
Strickland
awards
committee
Nobel
18
Layer p
Q
K
V
A
[Vaswani et al. 2017]
committee awards Strickland advanced optics
who
Nobel
Slides by Emma Strubell!
Self-attention (in encoder)
19
Layer p
Q
K
V
[Vaswani et al. 2017]
committee awards Strickland advanced optics
who
Nobel
optics
advanced
who
Strickland
awards
committee
Nobel
A
Slides by Emma Strubell!
Self-attention (in encoder)
20
Layer p
Q
K
V
[Vaswani et al. 2017]
committee awards Strickland advanced optics
who
Nobel
optics
advanced
who
Strickland
awards
committee
Nobel
A
Slides by Emma Strubell!
Self-attention (in encoder)
21
Layer p
Q
K
V
M
[Vaswani et al. 2017]
committee awards Strickland advanced optics
who
Nobel
optics
advanced
who
Strickland
awards
committee
Nobel
A
Slides by Emma Strubell!
Self-attention (in encoder)
22
Layer p
Q
K
V
M
[Vaswani et al. 2017]
committee awards Strickland advanced optics
who
Nobel
optics
advanced
who
Strickland
awards
committee
Nobel
A
Slides by Emma Strubell!
Self-attention (in encoder)
Multi-head self-attention
23
Layer p
Q
K
V
M
M1
MH
[Vaswani et al. 2017]
committee awards Strickland advanced optics
who
Nobel
optics
advanced
who
Strickland
awards
committee
Nobel
A
Slides by Emma Strubell!
24
Layer p
Q
K
V
MH
M1
[Vaswani et al. 2017]
committee awards Strickland advanced optics
who
Nobel
optics
advanced
who
Strickland
awards
committee
Nobel
A
Multi-head self-attention
Slides by Emma Strubell!
25
Layer p
Q
K
V
MH
M1
Layer
p+1
Feed
Forward
Feed
Forward
Feed
Forward
Feed
Forward
Feed
Forward
Feed
Forward
Feed
Forward
committee awards Strickland advanced optics
who
Nobel
[Vaswani et al. 2017]
optics
advanced
who
Strickland
awards
committee
Nobel
A
Multi-head self-attention
Slides by Emma Strubell!
26
committee awards Strickland advanced optics
who
Nobel
[Vaswani et al. 2017]
p+1
Multi-head self-attention
Slides by Emma Strubell!
Multi-head self-attention + feed forward
Multi-head self-attention + feed forward
27
Layer 1
Layer p
Multi-head self-attention + feed forward
Layer J
committee awards Strickland advanced optics
who
Nobel
[Vaswani et al. 2017]
Multi-head self-attention
Slides by Emma Strubell!
Position embeddings are added to each
word embedding. Otherwise, since we
have no recurrence, our model is unaware
of the position of a word in the sequence!
Residual connections, which mean that we
add the input to a particular block to its
output, help improve gradient flow
A feed-forward layer on top of the attention-
weighted averaged value vectors allows us
to add more parameters / nonlinearity
We stack as many of these
Transformer blocks on top of each
other as we can (bigger models are
generally better given enough data!)
Moving onto the decoder, which
takes in English sequences that
have been shifted to the right
(e.g., <START> schools opened
their)
We first have an instance of
masked self attention. Since
the decoder is responsible
for predicting the English
words, we need to apply
masking as we saw before.
We first have an instance of
masked self attention. Since
the decoder is responsible
for predicting the English
words, we need to apply
masking as we saw before.
Why don’t we do
masked self-attention
in the encoder?
Now, we have cross attention,
which connects the decoder to
the encoder by enabling it to
attend over the encoder’s final
hidden states.
After stacking a bunch of
these decoder blocks, we
finally have our familiar
Softmax layer to predict
the next English word
Positional encoding
Creating positional encodings?
• We could just concatenate a fixed value to each time step
(e.g., 1, 2, 3, … 1000) that corresponds to its position,
but then what happens if we get a sequence with 5000
words at test time?
• We want something that can generalize to arbitrary
sequence lengths. We also may want to make attending
to relative positions (e.g., tokens in a local window to the
current token) easier.
• Distance between two positions should be consistent
with variable-length inputs
Intuitive example
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
Transformer positional encoding
Positional encoding is a 512d vector
i = a particular dimension of this vector
pos = dimension of the word
d_model = 512
What does this look like?
(each row is the pos. emb. of a 50-word sentence)
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
Despite the intuitive flaws, many models
these days use learned positional
embeddings (i.e., they cannot generalize to
longer sequences, but this isn’t a big deal for
their use cases)
Hacks to make
Transformers work
I went to class and took ___
0 0 1 0 0
cats TV notes took sofa
0.025 0.025 0.9 0.025 0.025
with label smoothing
Get penalized for
overconfidence!
Loss
Target word confidence
Why these decisions?
Unsatisfying answer: they empirically worked well.
Neural architecture search finds even better Transformer variants:
Primer: Searching for efficient Transformer architectures… So et al., Sep. 2021
OpenAI’s Transformer LMs
• GPT (Jun 2018): 117 million parameters, trained on
13GB of data (~1 billion tokens)
• GPT2 (Feb 2019): 1.5 billion parameters, trained on
40GB of data
• GPT3 (July 2020): 175 billion parameters, ~500GB
data (300 billion tokens)
Coming up!
• Transfer learning via Transformer models like BERT
• Tokenization (word vs subword vs character/byte)
• Prompt-based learning
• Efficient / long-range Transformers
• Downstream tasks

More Related Content

Similar to 05-transformers.pdf

Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdfRamya Nellutla
 
CS415 - Lecture 11 - CSPs I.pptx
CS415 - Lecture 11 - CSPs I.pptxCS415 - Lecture 11 - CSPs I.pptx
CS415 - Lecture 11 - CSPs I.pptxHina Jamil
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptxGowrySailaja
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflowKeon Kim
 
Document Analysis with Deep Learning
Document Analysis with Deep LearningDocument Analysis with Deep Learning
Document Analysis with Deep Learningaiaioo
 
Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)Alexander Korbonits
 
CS2103 Lecture8 - design pattern
CS2103 Lecture8 - design patternCS2103 Lecture8 - design pattern
CS2103 Lecture8 - design patternJason Jiang
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingApache MXNet
 
hw1.docxCS 211 Homework #1Please complete the homework problem.docx
hw1.docxCS 211 Homework #1Please complete the homework problem.docxhw1.docxCS 211 Homework #1Please complete the homework problem.docx
hw1.docxCS 211 Homework #1Please complete the homework problem.docxwellesleyterresa
 
Study of End to End memory networks
Study of End to End memory networksStudy of End to End memory networks
Study of End to End memory networksASHISH MENKUDALE
 
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Jeongkyu Shin
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.pptRahulTr22
 
Data science programming .ppt
Data science programming .pptData science programming .ppt
Data science programming .pptGanesh E
 

Similar to 05-transformers.pdf (20)

Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
 
CS415 - Lecture 11 - CSPs I.pptx
CS415 - Lecture 11 - CSPs I.pptxCS415 - Lecture 11 - CSPs I.pptx
CS415 - Lecture 11 - CSPs I.pptx
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
 
Chatbot ppt
Chatbot pptChatbot ppt
Chatbot ppt
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflow
 
Document Analysis with Deep Learning
Document Analysis with Deep LearningDocument Analysis with Deep Learning
Document Analysis with Deep Learning
 
Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)
 
Image captioning
Image captioningImage captioning
Image captioning
 
CS2103 Lecture8 - design pattern
CS2103 Lecture8 - design patternCS2103 Lecture8 - design pattern
CS2103 Lecture8 - design pattern
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
BERT.pptx
BERT.pptxBERT.pptx
BERT.pptx
 
hw1.docxCS 211 Homework #1Please complete the homework problem.docx
hw1.docxCS 211 Homework #1Please complete the homework problem.docxhw1.docxCS 211 Homework #1Please complete the homework problem.docx
hw1.docxCS 211 Homework #1Please complete the homework problem.docx
 
Study of End to End memory networks
Study of End to End memory networksStudy of End to End memory networks
Study of End to End memory networks
 
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
 
Data Science
Data Science Data Science
Data Science
 
Lec1cgu13updated.ppt
Lec1cgu13updated.pptLec1cgu13updated.ppt
Lec1cgu13updated.ppt
 
Data science programming .ppt
Data science programming .pptData science programming .ppt
Data science programming .ppt
 

Recently uploaded

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 

Recently uploaded (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 

05-transformers.pdf

  • 1. Transformers and sequence- to-sequence learning CS 685, Fall 2021 Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst some slides from Emma Strubell
  • 2. Attention is great • Attention significantly improves NMT performance • It’s very useful to allow decoder to focus on certain parts of the source • Attention solves the bottleneck problem • Attention allows decoder to look directly at source; bypass bottleneck • Attention helps with vanishing gradient problem • Provides shortcut to faraway states • Attention provides some interpretability • By inspecting attention distribution, we can see what the decoder was focusing on • We get alignment for free! • This is cool because we never explicitly trained an alignment system • The network just learned alignment by itself Alignments: hardest The poor don’t have any money Les pauvres sont démunis The poor don t have any money Les pauvres sont démunis From last time • Project proposals now due 10/1! • Quiz 2 due Friday • Can TAs record zoom office hours? Maybe • How did we get this grid in the previous lecture? Will explain in today’s class. • Final proj. reports due Dec. 16th
  • 4. sequence-to-sequence learning • we’ll use French (f) to English (e) as a running example • goal: given French sentence f with tokens f1, f2, … fn produce English translation e with tokens e1, e2, … em arg max e p(e| f ) • real goal: compute Used when inputs and outputs are both sequences of words (e.g., machine translation, summarization)
  • 5. This is an instance of conditional language modeling p(e| f ) = p(e1, e2, …, em | f ) = p(e1 | f ) ⋅ p(e2 |e1, f ) ⋅ p(e3 |e2, e1, f ) ⋅ … = m ∏ i=1 p(ei |e1, …, ei−1, f ) Just like we’ve seen before, except we additionally condition our prediction of the next word on some other input (here, the French sentence)
  • 6. seq2seq models • use two different neural networks to model • first we have the encoder, which encodes the French sentence f • then, we have the decoder, which produces the English sentence e 6 L ∏ i=1 p(ei |e1, …, ei−1, f )
  • 7. 7 Encoder RNN Neural Machine Translation (NMT) 2/15/18 23 <START> Source sentence (input) les pauvres sont démunis The sequence-to-sequence model Target sentence (output) Decoder RNN Encoder RNN produces an encoding of the source sentence. Encoding of the source sentence. Provides initial hidden state for Decoder RNN. Decoder RNN is a Language Model that generates target sentence conditioned on encoding. the argmax the argmax poor poor argmax don’t Note: This diagram shows test time behavior: decoder output is fed in as next step’s input have any money <END> don’t have any money argmax argmax argmax argmax
  • 8. 8 Encoder RNN Neural Machine Translation (NMT) 2/15/18 23 <START> Source sentence (input) les pauvres sont démunis The sequence-to-sequence model Target sentence (output) Decoder RNN Encoder RNN produces an encoding of the source sentence. Encoding of the source sentence. Provides initial hidden state for Decoder RNN. Decoder RNN is a Language Model that generates target sentence conditioned on encoding. the argmax the argmax poor poor argmax don’t Note: This diagram shows test time behavior: decoder output is fed in as next step’s input have any money <END> don’t have any money argmax argmax argmax argmax
  • 9. 9 Training a Neural Machine Translation system 2/15/18 25 Encoder RNN Source sentence (from corpus) <START> the poor don’t have any money les pauvres sont démunis Target sentence (from corpus) Seq2seq is optimized as a single system. Backpropagation operates “end to end”. Decoder RNN ! "# ! "$ ! "% ! "& ! "' ! "( ! ") *# *$ *% *& *' *( *) = negative log prob of “the” * = 1 - . /0# 1 */ = + + + + + + = negative log prob of <END> = negative log prob of “have”
  • 10. We’ll talk much more about machine translation / other seq2seq problems later… but for now, let’s go back to the Transformer
  • 11.
  • 14. encoder decoder So far we’ve just talked about self-attention… what is all this other stuff?
  • 15. Self-attention (in encoder) 15 committee awards Strickland advanced optics who Layer p Q K V [Vaswani et al. 2017] Nobel Slides by Emma Strubell!
  • 16. 16 Layer p Q K V [Vaswani et al. 2017] committee awards Strickland advanced optics who Nobel Slides by Emma Strubell! Self-attention (in encoder)
  • 17. 17 Layer p Q K V optics advanced who Strickland awards committee Nobel [Vaswani et al. 2017] committee awards Strickland advanced optics who Nobel Slides by Emma Strubell! Self-attention (in encoder)
  • 18. optics advanced who Strickland awards committee Nobel 18 Layer p Q K V A [Vaswani et al. 2017] committee awards Strickland advanced optics who Nobel Slides by Emma Strubell! Self-attention (in encoder)
  • 19. 19 Layer p Q K V [Vaswani et al. 2017] committee awards Strickland advanced optics who Nobel optics advanced who Strickland awards committee Nobel A Slides by Emma Strubell! Self-attention (in encoder)
  • 20. 20 Layer p Q K V [Vaswani et al. 2017] committee awards Strickland advanced optics who Nobel optics advanced who Strickland awards committee Nobel A Slides by Emma Strubell! Self-attention (in encoder)
  • 21. 21 Layer p Q K V M [Vaswani et al. 2017] committee awards Strickland advanced optics who Nobel optics advanced who Strickland awards committee Nobel A Slides by Emma Strubell! Self-attention (in encoder)
  • 22. 22 Layer p Q K V M [Vaswani et al. 2017] committee awards Strickland advanced optics who Nobel optics advanced who Strickland awards committee Nobel A Slides by Emma Strubell! Self-attention (in encoder)
  • 23. Multi-head self-attention 23 Layer p Q K V M M1 MH [Vaswani et al. 2017] committee awards Strickland advanced optics who Nobel optics advanced who Strickland awards committee Nobel A Slides by Emma Strubell!
  • 24. 24 Layer p Q K V MH M1 [Vaswani et al. 2017] committee awards Strickland advanced optics who Nobel optics advanced who Strickland awards committee Nobel A Multi-head self-attention Slides by Emma Strubell!
  • 25. 25 Layer p Q K V MH M1 Layer p+1 Feed Forward Feed Forward Feed Forward Feed Forward Feed Forward Feed Forward Feed Forward committee awards Strickland advanced optics who Nobel [Vaswani et al. 2017] optics advanced who Strickland awards committee Nobel A Multi-head self-attention Slides by Emma Strubell!
  • 26. 26 committee awards Strickland advanced optics who Nobel [Vaswani et al. 2017] p+1 Multi-head self-attention Slides by Emma Strubell!
  • 27. Multi-head self-attention + feed forward Multi-head self-attention + feed forward 27 Layer 1 Layer p Multi-head self-attention + feed forward Layer J committee awards Strickland advanced optics who Nobel [Vaswani et al. 2017] Multi-head self-attention Slides by Emma Strubell!
  • 28. Position embeddings are added to each word embedding. Otherwise, since we have no recurrence, our model is unaware of the position of a word in the sequence!
  • 29. Residual connections, which mean that we add the input to a particular block to its output, help improve gradient flow
  • 30. A feed-forward layer on top of the attention- weighted averaged value vectors allows us to add more parameters / nonlinearity
  • 31. We stack as many of these Transformer blocks on top of each other as we can (bigger models are generally better given enough data!)
  • 32. Moving onto the decoder, which takes in English sequences that have been shifted to the right (e.g., <START> schools opened their)
  • 33. We first have an instance of masked self attention. Since the decoder is responsible for predicting the English words, we need to apply masking as we saw before.
  • 34. We first have an instance of masked self attention. Since the decoder is responsible for predicting the English words, we need to apply masking as we saw before. Why don’t we do masked self-attention in the encoder?
  • 35. Now, we have cross attention, which connects the decoder to the encoder by enabling it to attend over the encoder’s final hidden states.
  • 36. After stacking a bunch of these decoder blocks, we finally have our familiar Softmax layer to predict the next English word
  • 38. Creating positional encodings? • We could just concatenate a fixed value to each time step (e.g., 1, 2, 3, … 1000) that corresponds to its position, but then what happens if we get a sequence with 5000 words at test time? • We want something that can generalize to arbitrary sequence lengths. We also may want to make attending to relative positions (e.g., tokens in a local window to the current token) easier. • Distance between two positions should be consistent with variable-length inputs
  • 40. Transformer positional encoding Positional encoding is a 512d vector i = a particular dimension of this vector pos = dimension of the word d_model = 512
  • 41. What does this look like? (each row is the pos. emb. of a 50-word sentence) https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
  • 42. Despite the intuitive flaws, many models these days use learned positional embeddings (i.e., they cannot generalize to longer sequences, but this isn’t a big deal for their use cases)
  • 44.
  • 45. I went to class and took ___ 0 0 1 0 0 cats TV notes took sofa 0.025 0.025 0.9 0.025 0.025 with label smoothing
  • 47. Why these decisions? Unsatisfying answer: they empirically worked well. Neural architecture search finds even better Transformer variants: Primer: Searching for efficient Transformer architectures… So et al., Sep. 2021
  • 48. OpenAI’s Transformer LMs • GPT (Jun 2018): 117 million parameters, trained on 13GB of data (~1 billion tokens) • GPT2 (Feb 2019): 1.5 billion parameters, trained on 40GB of data • GPT3 (July 2020): 175 billion parameters, ~500GB data (300 billion tokens)
  • 49. Coming up! • Transfer learning via Transformer models like BERT • Tokenization (word vs subword vs character/byte) • Prompt-based learning • Efficient / long-range Transformers • Downstream tasks