Insoo Chung
Big ideas in NLP
Simple and e
ff
ective big ideas that led us here — CSCE181
My name is Insoo Chung
• Graduate student at Texas A&M
• Worked on Machine Translation @ Samsung Research
• On-device translation, simultaneous interpretation,
speech-to-speech translation
• Published at ACL, EMNLP, ICASSP, and WMT
• Interned with Siri @ Apple this Summer
• Performance analysis of on-device Siri
• NLP enthusiast!
Howdy
Big ideas in NLP
What we are covering today
1. Natural language processing — what and why
2. Deep learning — a tiny bit
3. Big ideas
👉 Word2Vec — words
👉 Recurrent neural networks — sentence input
👉 Sequence to sequence framework — sentence output
👉 Attention — long sentence
Natural language processing
De
fi
nition
Branch of AI, concerned with giving computers the ability to understand natural language
utterance → text → text → utterance
(usually)
🗣📱
De
fi
nition from: https://www.ibm.com/cloud/learn/natural-language-processing
+ generating natural language as well
Language and intelligence
Why?
+ Most knowledge is in stored in some form of a language → we want computers to exploit it
If computers could process NL, it would be more useful to people.
Language and intelligence
Why?
+ Most knowledge is in stored in some form of a language → we want computers to exploit it
If computers could process NL, it would be more useful to people.
https://dev.to/spectrumcetb/evolution-of-nlp-f54
Language and intelligence
Turing test
Linguistic ability ≈ Intelligence
Language and intelligence
Turing test
“I propose to consider the question, ‘Can machines think?’”, Turing
https://en.wikipedia.org/wiki/Turing_test
If C cannot determine which player is a human, machine passes the test
Linguistic ability ≈ Intelligence
Language and intelligence
Turing test
“I propose to consider the question, ‘Can machines think?’”, Turing
https://en.wikipedia.org/wiki/Turing_test
If C cannot determine which player is a human, machine passes the test
Linguistic ability ≈ Intelligence
Language and intelligence
Arti
fi
cial general intelligence
Language and intelligence
Arti
fi
cial general intelligence
Language and intelligence
Arti
fi
cial general intelligence
Turing test is one of the tests used for con
fi
rming human-level AGI
Language and intelligence
Arti
fi
cial general intelligence
Turing test is one of the tests used for con
fi
rming human-level AGI
Linguistic abilities of AI is important in moving towards the next step
Language and intelligence
Arti
fi
cial general intelligence
Turing test is one of the tests used for con
fi
rming human-level AGI
Linguistic abilities of AI is important in moving towards the next step
But language is so weird!
Slide from Stanford’s CS241n course, original cartoon by Randall Monroe
Slide from Stanford’s CS241n course, original cartoon by Randall Monroe
Then deep learning
happened 💥
Use cases
Commercial products
Machine translation (text → text)
Voice assistants (utterance → utterance)
Source: (left) Google translate, (right) Apple Siri
Use cases
Commercial products
Machine translation (text → text)
Voice assistants (utterance → utterance)
Source: (left) Google translate, (right) Apple Siri
So what’s
under the hood?
Deep learning
… for NLP
NLP models can be viewed as a conditional probability function which can be learned using deep
learning.
f( ⃗
y | ⃗
x , ⃗
θ )
NLP models can be viewed as a conditional probability function which can be learned using deep
learning.
P( ⃗
y | ⃗
x , ⃗
θ )
With x as input and
𝛉
as model params,
f outputs most probable outcome y.
Deep learning
… for NLP
NLP models can be viewed as a conditional probability function which can be learned using deep
learning.
Deep learning provides an effective way to learn the model , given A LOT of data and computation.
P( ⃗
y | ⃗
x , ⃗
θ )
θ
With x as input and
𝛉
as model params,
f outputs most probable outcome y.
Deep learning
… for NLP
Big ideas in NLP
What we are covering today
👉Word2Vec
👉Recurrent neural networks
👉Sequence to sequence framework
👉Attention - a tiny bit
Word2Vec
Language representation in computers
Words are represented as vectors of numbers in NLP. How?
1. Words are associated with random vectors:
2. We go through many sentences and learn that predicts prev/next word probability correctly.
θ
Example from: https://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture01-wordvecs1.pdf
brown = [+0.3, − 0.4, + 0.2, − 0.3,...]T
fox = [−0.2, − 0.1, − 0.1, + 0.3,...]T
Word2Vec
Example from: https://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture01-wordvecs1.pdf
Words are represented as vectors of numbers in NLP. How?
1. Words are associated with random vectors:
2. We go through many sentences and learn that predicts prev/next word probability correctly.
3. The result?
• Word vectors populated in n-d space that holds semantic/syntactic meaning
θ
brown = [+0.3, − 0.4, + 0.2, − 0.3,...]T
fox = [−0.2, − 0.1, − 0.1, + 0.3,...]T
Language representation in computers
Word2Vec
Learned vectors
Semantically close words are near each other
Word2Vec
Learned vectors
Semantically close words are near each other
Syntactic relationships are preserved with relative positioning
e.g. ⃗
slower − ⃗
slow ≈ ⃗
faster − ⃗
fast
Word2Vec
Learned vectors
Semantically close words are near each other
Syntactic relationships are preserved with relative positioning
e.g. ⃗
slower − ⃗
slow ≈ ⃗
faster − ⃗
fast
Word2Vec
Learned vectors
Semantically close words are near each other
We have
computable
representations
for words!
Syntactic relationships are preserved with relative positioning
e.g. ⃗
slower − ⃗
slow ≈ ⃗
faster − ⃗
fast
Recurrent neural networks
Dealing with sequence inputs
We now know to deal with words using adjacency stats, but how do we handle sentences?
→ Consider movie review sentiment analysis.
Recurrent neural networks
Dealing with sequence inputs
We now know to deal with words using adjacency stats, but how do we handle sentences?
→ Consider movie review sentiment analysis.
Negative
How do we deal with a sentence, i.e. a sequence of words?
→ If we consider every possible sentences, the possible # of inputs would be - intractable
∞
Recurrent neural networks
Dealing with sequence inputs
Very good 1: positive
I enjoyed this as much as my cat enjoys baths 0: negative
f( ⃗
y | ⃗
x , ⃗
θ )
How do we deal with a sentence, i.e. a sequence of words?
→ Break it down to word level: then the possible # of words wouldn’t be that many (~30K) - tractable!
Recurrent neural networks
Dealing with sequence inputs
[“Very”, “good”] 1: positive
[“I”, “enjoyed”, “this”, “as”, “much”, “as”, “my”,
“cat”, “enjoys”, “baths”]
0: negative
f( ⃗
y | ⃗
x , ⃗
θ )
Recurrent neural network
1. Handle words step-by-step.
2. Use previous and vector to create the next
3. Use the final step’s output to determine the result
⃗
word ⃗
state ⃗
state'
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
Recurrent neural networks
Dealing with sequence inputs
Recurrent neural network
1. Handle words step-by-step.
2. Use previous and vector to create the next
3. Use the final step’s output to determine the result
⃗
word ⃗
state ⃗
state'
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
Recurrent neural networks
Dealing with sequence inputs
Context
Word
New context
Recurrent neural network
1. Handle words step-by-step.
2. Use previous and vector to create the next
3. Use the final step’s output to determine the result
⃗
word ⃗
state ⃗
state'
Recurrent neural networks
Dealing with sequence inputs
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
h<0> f f
Very good
1: positive
h<1>
h<2>
h<0>
Recurrent neural network
1. Handle words step-by-step.
2. Use previous and vector to create the next
3. Use the final step’s output to determine the result
⃗
word ⃗
state ⃗
state'
̂
y(s) = P(sentiment|review)
= h<2>
= f(x<2>
, h<1>
)
= f(x<2>
, f(x<1>
, h<0>
))
Recurrent neural networks
Dealing with sequence inputs
f f
Very good
1: positive
h<1>
h<2>
h<0>
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
f f
Very good
1: positive
h<1>
h<2>
h<0>
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
Recurrent neural network
1. Handle words step-by-step.
2. Use previous and vector to create the next
3. Use the final step’s output to determine the result
⃗
word ⃗
state ⃗
state'
Step 1
Step 1
Recurrent neural networks
Dealing with sequence inputs
̂
y(s) = P(sentiment|review)
= h<2>
= f(x<2>
, h<1>
)
= f(x<2>
, f(x<1>
, h<0>
))
f f
Very good
1: positive
h<1>
h<2>
h<0>
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
Recurrent neural network
1. Handle words step-by-step.
2. Use previous and vector to create the next
3. Use the final step’s output to determine the result
⃗
word ⃗
state ⃗
state'
Step 1
Step 1
Recurrent neural networks
Dealing with sequence inputs
̂
y(s) = P(sentiment|review)
= h<2>
= f(x<2>
, h<1>
)
= f(x<2>
, f(x<1>
, h<0>
))
Step 2
Step 2
Recurrent neural networks
Use case
Recurrent neural networks
Use case
We can read
-many
possible
sentences!
∞
But how can we produce sequence outputs?
Seq2seq
Producing sequence outputs
Same principal: produce one word at a time
Seq2seq
Producing sequence outputs
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
Same principal: produce one word at a time
Seq2seq
Producing sequence outputs
Words are read by encoder RNN
and is updated
⃗
state
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
Same principal: produce one word at a time
Seq2seq
Producing sequence outputs
Final of encoder is fed
as the initial of the decoder
⃗
state
⃗
state0
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
Same principal: produce one word at a time
Seq2seq
Producing sequence outputs
Decoder RNN does its thing:
Emits output word one at a time depending on the
is also updated for the next step.
⃗
state
⃗
state
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
Same principal: produce one word at a time
Seq2seq
Producing sequence outputs
Terminates when special token is emitted
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
Same principal: produce one word at a time
Seq2seq
Producing sequence outputs
Terminates when special token is emitted
We can generate
-many
possible
sentences!
∞
Attention
Handling long dependency
Words can be too far away!
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
Words can be too far away!
Attention
Handling long dependency
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
Attention
Handling long dependency
P( ⃗
h ′

t | ⃗
y t−1, ⃗
h ′

t−1, ⃗
θ )
Words can be too far away!
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
Words can be too far away!
Attention
Handling long dependency
P( ⃗
h ′

t | ⃗
y t−1, ⃗
h ′

t−1, ⃗
θ )
⃗
h ′

t accounts ⃗
x i ∈ X, ⃗
y i<t
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
Words can be too far away!
Attention
Handling long dependency
P( ⃗
h ′

t | ⃗
y t−1, ⃗
h ′

t−1, ⃗
θ )
⃗
h ′

t accounts ⃗
x i ∈ X, ⃗
y i<t
- Questions and answers can easily have 20 words
- Relying on a 1 context vector to account for context of 30+ words is not optimal
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
Attention
Handling long dependency
Image source: https://www.tensor
fl
ow.org/text/tutorials/nmt_with_attention
Taylor made context vector for each step!
Attention
Handling long dependency
Taylor made context vector for each step!
Image source: https://www.tensor
fl
ow.org/text/tutorials/nmt_with_attention
Attention
Handling long dependency
Taylor made context vector for each step!
👈 no details today
Image source: https://www.tensor
fl
ow.org/text/tutorials/nmt_with_attention
Attention-based model transformer greatly improved NLP performance
👉Parallel encoding is possible (decoding is still auto-regressive)
👉SOTA performance in multitude of tasks
👉Performance scales indefinitely with size of data and number of parameters
Transformer
What empowered seq2seq framework further
Attention-based model transformer greatly improved NLP performance
👉Parallel encoding is possible (decoding is still auto-regressive)
👉SOTA performance in multitude of tasks
👉Performance scales indefinitely with size of data and number of parameters
Transformer
What empowered seq2seq framework further
State of the art
GPT-3
Language models are
fl
exible task solvers!
Source: https://beta.openai.com/examples
State of the art
GPT-3
Source: https://beta.openai.com/examples
State of the art
Codex (GPT-3 descendant)
Image source: Github Co-pilot
Use cases
https://ai.googleblog.com/2019/05/introducing-translatotron-end-to-end.html
Audio = array of spectrogram patches
No textual representation in-between audio in/out
Di
ff
erent modalities
The sequence view of NLP provides a useful view for other modalities!
Use cases
Di
ff
erent modalities
https://arxiv.org/pdf/2010.11929v2.pdf
Image = array of image patches
Use cases
https://arxiv.org/pdf/2102.00719.pdf
Video = array of images
Di
ff
erent modalities
State of the art
DALL-E 2
Image source: https://openai.com/dall-e-2/
NLP model as encoder Generative model as decoder
State of the art
DALL-E 2
Image source: https://openai.com/dall-e-2/
NLP model as encoder Generative model as decoder
State of the art
Midjourney examples
@midjourneyartwork
State of the art
Midjourney examples
@midjourney.architecture
Recap
Big ideas in NLP
A. Language is important, but hard to compute
A. Context, nuances, -many possible sentences
B. Word2vec creates a mean to map words to meaning vectors
A. Allows computational representation
C. RNN can read sentences at word space
A. Compute friendly
D. Seq2seq provides a way to generate sentences
A. More
fl
exibility
E. Attention lets you handle long sentences.
∞
Recap
Big ideas in NLP
A. Language is important, but hard to compute
A. Context, nuances, -many possible sentences
B. Word2vec creates a mean to map words to meaning vectors
A. Allows computational representation
C. RNN can read sentences at word space
A. Compute friendly
D. Seq2seq provides a way to generate sentences
A. More
fl
exibility
E. Attention lets you handle long sentences
∞
Recap
Big ideas in NLP
A. Language is important, but hard to compute
A. Context, nuances, -many possible sentences
B. Word2vec creates a mean to map words to meaning vectors
A. Allows computational representation
C. RNN can read sentences at word space
A. Compute friendly
D. Seq2seq provides a way to generate sentences
A. More
fl
exibility
E. Attention lets you handle long sentences
∞
Simple and
effective ideas
changed the game
CEO of OpenAI
CEO of OpenAI
Implications?
The father of FPS
John Carmack on Lex Fridman’s podcast: https://youtu.be/xLi83prR5fg
He also said (something along these lines):
“The remaining ideas are simple enough to be written down on a back of an envelope”
“The code for AGI will be ~10,000 lines of code and will take one man to implement it”
The father of FPS
Where to learn more?
A. Stanford CS224n: Natural Language Processing - lecture
B. Unreasonable E
ff
ectiveness of Recurrent Neural Networks - article
C. Illustrated Transformer - article
D. Speech and Text Processing - book
E. MIT 6.S191: Introduction to Deep Learning - lecture
Questions?

CSCE181 Big ideas in NLP

  • 1.
    Insoo Chung Big ideasin NLP Simple and e ff ective big ideas that led us here — CSCE181
  • 2.
    My name isInsoo Chung • Graduate student at Texas A&M • Worked on Machine Translation @ Samsung Research • On-device translation, simultaneous interpretation, speech-to-speech translation • Published at ACL, EMNLP, ICASSP, and WMT • Interned with Siri @ Apple this Summer • Performance analysis of on-device Siri • NLP enthusiast! Howdy
  • 3.
    Big ideas inNLP What we are covering today 1. Natural language processing — what and why 2. Deep learning — a tiny bit 3. Big ideas 👉 Word2Vec — words 👉 Recurrent neural networks — sentence input 👉 Sequence to sequence framework — sentence output 👉 Attention — long sentence
  • 4.
    Natural language processing De fi nition Branchof AI, concerned with giving computers the ability to understand natural language utterance → text → text → utterance (usually) 🗣📱 De fi nition from: https://www.ibm.com/cloud/learn/natural-language-processing + generating natural language as well
  • 5.
    Language and intelligence Why? +Most knowledge is in stored in some form of a language → we want computers to exploit it If computers could process NL, it would be more useful to people.
  • 6.
    Language and intelligence Why? +Most knowledge is in stored in some form of a language → we want computers to exploit it If computers could process NL, it would be more useful to people. https://dev.to/spectrumcetb/evolution-of-nlp-f54
  • 7.
    Language and intelligence Turingtest Linguistic ability ≈ Intelligence
  • 8.
    Language and intelligence Turingtest “I propose to consider the question, ‘Can machines think?’”, Turing https://en.wikipedia.org/wiki/Turing_test If C cannot determine which player is a human, machine passes the test Linguistic ability ≈ Intelligence
  • 9.
    Language and intelligence Turingtest “I propose to consider the question, ‘Can machines think?’”, Turing https://en.wikipedia.org/wiki/Turing_test If C cannot determine which player is a human, machine passes the test Linguistic ability ≈ Intelligence
  • 10.
  • 11.
  • 12.
    Language and intelligence Arti fi cialgeneral intelligence Turing test is one of the tests used for con fi rming human-level AGI
  • 13.
    Language and intelligence Arti fi cialgeneral intelligence Turing test is one of the tests used for con fi rming human-level AGI Linguistic abilities of AI is important in moving towards the next step
  • 14.
    Language and intelligence Arti fi cialgeneral intelligence Turing test is one of the tests used for con fi rming human-level AGI Linguistic abilities of AI is important in moving towards the next step But language is so weird!
  • 15.
    Slide from Stanford’sCS241n course, original cartoon by Randall Monroe
  • 16.
    Slide from Stanford’sCS241n course, original cartoon by Randall Monroe Then deep learning happened 💥
  • 17.
    Use cases Commercial products Machinetranslation (text → text) Voice assistants (utterance → utterance) Source: (left) Google translate, (right) Apple Siri
  • 18.
    Use cases Commercial products Machinetranslation (text → text) Voice assistants (utterance → utterance) Source: (left) Google translate, (right) Apple Siri So what’s under the hood?
  • 19.
    Deep learning … forNLP NLP models can be viewed as a conditional probability function which can be learned using deep learning. f( ⃗ y | ⃗ x , ⃗ θ )
  • 20.
    NLP models canbe viewed as a conditional probability function which can be learned using deep learning. P( ⃗ y | ⃗ x , ⃗ θ ) With x as input and 𝛉 as model params, f outputs most probable outcome y. Deep learning … for NLP
  • 21.
    NLP models canbe viewed as a conditional probability function which can be learned using deep learning. Deep learning provides an effective way to learn the model , given A LOT of data and computation. P( ⃗ y | ⃗ x , ⃗ θ ) θ With x as input and 𝛉 as model params, f outputs most probable outcome y. Deep learning … for NLP
  • 22.
    Big ideas inNLP What we are covering today 👉Word2Vec 👉Recurrent neural networks 👉Sequence to sequence framework 👉Attention - a tiny bit
  • 23.
    Word2Vec Language representation incomputers Words are represented as vectors of numbers in NLP. How? 1. Words are associated with random vectors: 2. We go through many sentences and learn that predicts prev/next word probability correctly. θ Example from: https://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture01-wordvecs1.pdf brown = [+0.3, − 0.4, + 0.2, − 0.3,...]T fox = [−0.2, − 0.1, − 0.1, + 0.3,...]T
  • 24.
    Word2Vec Example from: https://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture01-wordvecs1.pdf Wordsare represented as vectors of numbers in NLP. How? 1. Words are associated with random vectors: 2. We go through many sentences and learn that predicts prev/next word probability correctly. 3. The result? • Word vectors populated in n-d space that holds semantic/syntactic meaning θ brown = [+0.3, − 0.4, + 0.2, − 0.3,...]T fox = [−0.2, − 0.1, − 0.1, + 0.3,...]T Language representation in computers
  • 25.
  • 26.
    Word2Vec Learned vectors Semantically closewords are near each other Syntactic relationships are preserved with relative positioning e.g. ⃗ slower − ⃗ slow ≈ ⃗ faster − ⃗ fast
  • 27.
    Word2Vec Learned vectors Semantically closewords are near each other Syntactic relationships are preserved with relative positioning e.g. ⃗ slower − ⃗ slow ≈ ⃗ faster − ⃗ fast
  • 28.
    Word2Vec Learned vectors Semantically closewords are near each other We have computable representations for words! Syntactic relationships are preserved with relative positioning e.g. ⃗ slower − ⃗ slow ≈ ⃗ faster − ⃗ fast
  • 29.
    Recurrent neural networks Dealingwith sequence inputs We now know to deal with words using adjacency stats, but how do we handle sentences? → Consider movie review sentiment analysis.
  • 30.
    Recurrent neural networks Dealingwith sequence inputs We now know to deal with words using adjacency stats, but how do we handle sentences? → Consider movie review sentiment analysis. Negative
  • 31.
    How do wedeal with a sentence, i.e. a sequence of words? → If we consider every possible sentences, the possible # of inputs would be - intractable ∞ Recurrent neural networks Dealing with sequence inputs Very good 1: positive I enjoyed this as much as my cat enjoys baths 0: negative f( ⃗ y | ⃗ x , ⃗ θ )
  • 32.
    How do wedeal with a sentence, i.e. a sequence of words? → Break it down to word level: then the possible # of words wouldn’t be that many (~30K) - tractable! Recurrent neural networks Dealing with sequence inputs [“Very”, “good”] 1: positive [“I”, “enjoyed”, “this”, “as”, “much”, “as”, “my”, “cat”, “enjoys”, “baths”] 0: negative f( ⃗ y | ⃗ x , ⃗ θ )
  • 33.
    Recurrent neural network 1.Handle words step-by-step. 2. Use previous and vector to create the next 3. Use the final step’s output to determine the result ⃗ word ⃗ state ⃗ state' P( ⃗ h t+1 | ⃗ x t, ⃗ h t, ⃗ θ ) Recurrent neural networks Dealing with sequence inputs
  • 34.
    Recurrent neural network 1.Handle words step-by-step. 2. Use previous and vector to create the next 3. Use the final step’s output to determine the result ⃗ word ⃗ state ⃗ state' P( ⃗ h t+1 | ⃗ x t, ⃗ h t, ⃗ θ ) Recurrent neural networks Dealing with sequence inputs Context Word New context
  • 35.
    Recurrent neural network 1.Handle words step-by-step. 2. Use previous and vector to create the next 3. Use the final step’s output to determine the result ⃗ word ⃗ state ⃗ state' Recurrent neural networks Dealing with sequence inputs P( ⃗ h t+1 | ⃗ x t, ⃗ h t, ⃗ θ ) h<0> f f Very good 1: positive h<1> h<2> h<0>
  • 36.
    Recurrent neural network 1.Handle words step-by-step. 2. Use previous and vector to create the next 3. Use the final step’s output to determine the result ⃗ word ⃗ state ⃗ state' ̂ y(s) = P(sentiment|review) = h<2> = f(x<2> , h<1> ) = f(x<2> , f(x<1> , h<0> )) Recurrent neural networks Dealing with sequence inputs f f Very good 1: positive h<1> h<2> h<0> P( ⃗ h t+1 | ⃗ x t, ⃗ h t, ⃗ θ )
  • 37.
    f f Very good 1:positive h<1> h<2> h<0> P( ⃗ h t+1 | ⃗ x t, ⃗ h t, ⃗ θ ) Recurrent neural network 1. Handle words step-by-step. 2. Use previous and vector to create the next 3. Use the final step’s output to determine the result ⃗ word ⃗ state ⃗ state' Step 1 Step 1 Recurrent neural networks Dealing with sequence inputs ̂ y(s) = P(sentiment|review) = h<2> = f(x<2> , h<1> ) = f(x<2> , f(x<1> , h<0> ))
  • 38.
    f f Very good 1:positive h<1> h<2> h<0> P( ⃗ h t+1 | ⃗ x t, ⃗ h t, ⃗ θ ) Recurrent neural network 1. Handle words step-by-step. 2. Use previous and vector to create the next 3. Use the final step’s output to determine the result ⃗ word ⃗ state ⃗ state' Step 1 Step 1 Recurrent neural networks Dealing with sequence inputs ̂ y(s) = P(sentiment|review) = h<2> = f(x<2> , h<1> ) = f(x<2> , f(x<1> , h<0> )) Step 2 Step 2
  • 39.
  • 40.
    Recurrent neural networks Usecase We can read -many possible sentences! ∞
  • 41.
    But how canwe produce sequence outputs? Seq2seq Producing sequence outputs
  • 42.
    Same principal: produceone word at a time Seq2seq Producing sequence outputs Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
  • 43.
    Same principal: produceone word at a time Seq2seq Producing sequence outputs Words are read by encoder RNN and is updated ⃗ state Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
  • 44.
    Same principal: produceone word at a time Seq2seq Producing sequence outputs Final of encoder is fed as the initial of the decoder ⃗ state ⃗ state0 Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
  • 45.
    Same principal: produceone word at a time Seq2seq Producing sequence outputs Decoder RNN does its thing: Emits output word one at a time depending on the is also updated for the next step. ⃗ state ⃗ state Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
  • 46.
    Same principal: produceone word at a time Seq2seq Producing sequence outputs Terminates when special token is emitted Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
  • 47.
    Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8 Sameprincipal: produce one word at a time Seq2seq Producing sequence outputs Terminates when special token is emitted We can generate -many possible sentences! ∞
  • 48.
    Attention Handling long dependency Wordscan be too far away! Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
  • 49.
    Words can betoo far away! Attention Handling long dependency Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
  • 50.
    Attention Handling long dependency P(⃗ h ′  t | ⃗ y t−1, ⃗ h ′  t−1, ⃗ θ ) Words can be too far away! Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
  • 51.
    Words can betoo far away! Attention Handling long dependency P( ⃗ h ′  t | ⃗ y t−1, ⃗ h ′  t−1, ⃗ θ ) ⃗ h ′  t accounts ⃗ x i ∈ X, ⃗ y i<t Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
  • 52.
    Words can betoo far away! Attention Handling long dependency P( ⃗ h ′  t | ⃗ y t−1, ⃗ h ′  t−1, ⃗ θ ) ⃗ h ′  t accounts ⃗ x i ∈ X, ⃗ y i<t - Questions and answers can easily have 20 words - Relying on a 1 context vector to account for context of 30+ words is not optimal Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
  • 53.
    Attention Handling long dependency Imagesource: https://www.tensor fl ow.org/text/tutorials/nmt_with_attention Taylor made context vector for each step!
  • 54.
    Attention Handling long dependency Taylormade context vector for each step! Image source: https://www.tensor fl ow.org/text/tutorials/nmt_with_attention
  • 55.
    Attention Handling long dependency Taylormade context vector for each step! 👈 no details today Image source: https://www.tensor fl ow.org/text/tutorials/nmt_with_attention
  • 56.
    Attention-based model transformergreatly improved NLP performance 👉Parallel encoding is possible (decoding is still auto-regressive) 👉SOTA performance in multitude of tasks 👉Performance scales indefinitely with size of data and number of parameters Transformer What empowered seq2seq framework further
  • 57.
    Attention-based model transformergreatly improved NLP performance 👉Parallel encoding is possible (decoding is still auto-regressive) 👉SOTA performance in multitude of tasks 👉Performance scales indefinitely with size of data and number of parameters Transformer What empowered seq2seq framework further
  • 58.
    State of theart GPT-3 Language models are fl exible task solvers! Source: https://beta.openai.com/examples
  • 59.
    State of theart GPT-3 Source: https://beta.openai.com/examples
  • 60.
    State of theart Codex (GPT-3 descendant) Image source: Github Co-pilot
  • 61.
    Use cases https://ai.googleblog.com/2019/05/introducing-translatotron-end-to-end.html Audio =array of spectrogram patches No textual representation in-between audio in/out Di ff erent modalities The sequence view of NLP provides a useful view for other modalities!
  • 62.
  • 63.
    Use cases https://arxiv.org/pdf/2102.00719.pdf Video =array of images Di ff erent modalities
  • 64.
    State of theart DALL-E 2 Image source: https://openai.com/dall-e-2/ NLP model as encoder Generative model as decoder
  • 65.
    State of theart DALL-E 2 Image source: https://openai.com/dall-e-2/ NLP model as encoder Generative model as decoder
  • 66.
    State of theart Midjourney examples @midjourneyartwork
  • 67.
    State of theart Midjourney examples @midjourney.architecture
  • 68.
    Recap Big ideas inNLP A. Language is important, but hard to compute A. Context, nuances, -many possible sentences B. Word2vec creates a mean to map words to meaning vectors A. Allows computational representation C. RNN can read sentences at word space A. Compute friendly D. Seq2seq provides a way to generate sentences A. More fl exibility E. Attention lets you handle long sentences. ∞
  • 69.
    Recap Big ideas inNLP A. Language is important, but hard to compute A. Context, nuances, -many possible sentences B. Word2vec creates a mean to map words to meaning vectors A. Allows computational representation C. RNN can read sentences at word space A. Compute friendly D. Seq2seq provides a way to generate sentences A. More fl exibility E. Attention lets you handle long sentences ∞
  • 70.
    Recap Big ideas inNLP A. Language is important, but hard to compute A. Context, nuances, -many possible sentences B. Word2vec creates a mean to map words to meaning vectors A. Allows computational representation C. RNN can read sentences at word space A. Compute friendly D. Seq2seq provides a way to generate sentences A. More fl exibility E. Attention lets you handle long sentences ∞ Simple and effective ideas changed the game
  • 71.
  • 72.
  • 73.
  • 74.
    John Carmack onLex Fridman’s podcast: https://youtu.be/xLi83prR5fg He also said (something along these lines): “The remaining ideas are simple enough to be written down on a back of an envelope” “The code for AGI will be ~10,000 lines of code and will take one man to implement it” The father of FPS
  • 75.
    Where to learnmore? A. Stanford CS224n: Natural Language Processing - lecture B. Unreasonable E ff ectiveness of Recurrent Neural Networks - article C. Illustrated Transformer - article D. Speech and Text Processing - book E. MIT 6.S191: Introduction to Deep Learning - lecture
  • 76.