Introductory seminar on NLP for CS sophomores. Presented to Texas A&M's Fall 2022 CSCE181 class. Slides are a bit redundant due to compatibility issues :\
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
CSCE181 Big ideas in NLP
1. Insoo Chung
Big ideas in NLP
Simple and e
ff
ective big ideas that led us here — CSCE181
2. My name is Insoo Chung
• Graduate student at Texas A&M
• Worked on Machine Translation @ Samsung Research
• On-device translation, simultaneous interpretation,
speech-to-speech translation
• Published at ACL, EMNLP, ICASSP, and WMT
• Interned with Siri @ Apple this Summer
• Performance analysis of on-device Siri
• NLP enthusiast!
Howdy
3. Big ideas in NLP
What we are covering today
1. Natural language processing — what and why
2. Deep learning — a tiny bit
3. Big ideas
👉 Word2Vec — words
👉 Recurrent neural networks — sentence input
👉 Sequence to sequence framework — sentence output
👉 Attention — long sentence
4. Natural language processing
De
fi
nition
Branch of AI, concerned with giving computers the ability to understand natural language
utterance → text → text → utterance
(usually)
🗣📱
De
fi
nition from: https://www.ibm.com/cloud/learn/natural-language-processing
+ generating natural language as well
5. Language and intelligence
Why?
+ Most knowledge is in stored in some form of a language → we want computers to exploit it
If computers could process NL, it would be more useful to people.
6. Language and intelligence
Why?
+ Most knowledge is in stored in some form of a language → we want computers to exploit it
If computers could process NL, it would be more useful to people.
https://dev.to/spectrumcetb/evolution-of-nlp-f54
8. Language and intelligence
Turing test
“I propose to consider the question, ‘Can machines think?’”, Turing
https://en.wikipedia.org/wiki/Turing_test
If C cannot determine which player is a human, machine passes the test
Linguistic ability ≈ Intelligence
9. Language and intelligence
Turing test
“I propose to consider the question, ‘Can machines think?’”, Turing
https://en.wikipedia.org/wiki/Turing_test
If C cannot determine which player is a human, machine passes the test
Linguistic ability ≈ Intelligence
13. Language and intelligence
Arti
fi
cial general intelligence
Turing test is one of the tests used for con
fi
rming human-level AGI
Linguistic abilities of AI is important in moving towards the next step
14. Language and intelligence
Arti
fi
cial general intelligence
Turing test is one of the tests used for con
fi
rming human-level AGI
Linguistic abilities of AI is important in moving towards the next step
But language is so weird!
16. Slide from Stanford’s CS241n course, original cartoon by Randall Monroe
Then deep learning
happened 💥
17. Use cases
Commercial products
Machine translation (text → text)
Voice assistants (utterance → utterance)
Source: (left) Google translate, (right) Apple Siri
18. Use cases
Commercial products
Machine translation (text → text)
Voice assistants (utterance → utterance)
Source: (left) Google translate, (right) Apple Siri
So what’s
under the hood?
19. Deep learning
… for NLP
NLP models can be viewed as a conditional probability function which can be learned using deep
learning.
f( ⃗
y | ⃗
x , ⃗
θ )
20. NLP models can be viewed as a conditional probability function which can be learned using deep
learning.
P( ⃗
y | ⃗
x , ⃗
θ )
With x as input and
𝛉
as model params,
f outputs most probable outcome y.
Deep learning
… for NLP
21. NLP models can be viewed as a conditional probability function which can be learned using deep
learning.
Deep learning provides an effective way to learn the model , given A LOT of data and computation.
P( ⃗
y | ⃗
x , ⃗
θ )
θ
With x as input and
𝛉
as model params,
f outputs most probable outcome y.
Deep learning
… for NLP
22. Big ideas in NLP
What we are covering today
👉Word2Vec
👉Recurrent neural networks
👉Sequence to sequence framework
👉Attention - a tiny bit
23. Word2Vec
Language representation in computers
Words are represented as vectors of numbers in NLP. How?
1. Words are associated with random vectors:
2. We go through many sentences and learn that predicts prev/next word probability correctly.
θ
Example from: https://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture01-wordvecs1.pdf
brown = [+0.3, − 0.4, + 0.2, − 0.3,...]T
fox = [−0.2, − 0.1, − 0.1, + 0.3,...]T
24. Word2Vec
Example from: https://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture01-wordvecs1.pdf
Words are represented as vectors of numbers in NLP. How?
1. Words are associated with random vectors:
2. We go through many sentences and learn that predicts prev/next word probability correctly.
3. The result?
• Word vectors populated in n-d space that holds semantic/syntactic meaning
θ
brown = [+0.3, − 0.4, + 0.2, − 0.3,...]T
fox = [−0.2, − 0.1, − 0.1, + 0.3,...]T
Language representation in computers
26. Word2Vec
Learned vectors
Semantically close words are near each other
Syntactic relationships are preserved with relative positioning
e.g. ⃗
slower − ⃗
slow ≈ ⃗
faster − ⃗
fast
27. Word2Vec
Learned vectors
Semantically close words are near each other
Syntactic relationships are preserved with relative positioning
e.g. ⃗
slower − ⃗
slow ≈ ⃗
faster − ⃗
fast
28. Word2Vec
Learned vectors
Semantically close words are near each other
We have
computable
representations
for words!
Syntactic relationships are preserved with relative positioning
e.g. ⃗
slower − ⃗
slow ≈ ⃗
faster − ⃗
fast
29. Recurrent neural networks
Dealing with sequence inputs
We now know to deal with words using adjacency stats, but how do we handle sentences?
→ Consider movie review sentiment analysis.
30. Recurrent neural networks
Dealing with sequence inputs
We now know to deal with words using adjacency stats, but how do we handle sentences?
→ Consider movie review sentiment analysis.
Negative
31. How do we deal with a sentence, i.e. a sequence of words?
→ If we consider every possible sentences, the possible # of inputs would be - intractable
∞
Recurrent neural networks
Dealing with sequence inputs
Very good 1: positive
I enjoyed this as much as my cat enjoys baths 0: negative
f( ⃗
y | ⃗
x , ⃗
θ )
32. How do we deal with a sentence, i.e. a sequence of words?
→ Break it down to word level: then the possible # of words wouldn’t be that many (~30K) - tractable!
Recurrent neural networks
Dealing with sequence inputs
[“Very”, “good”] 1: positive
[“I”, “enjoyed”, “this”, “as”, “much”, “as”, “my”,
“cat”, “enjoys”, “baths”]
0: negative
f( ⃗
y | ⃗
x , ⃗
θ )
33. Recurrent neural network
1. Handle words step-by-step.
2. Use previous and vector to create the next
3. Use the final step’s output to determine the result
⃗
word ⃗
state ⃗
state'
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
Recurrent neural networks
Dealing with sequence inputs
34. Recurrent neural network
1. Handle words step-by-step.
2. Use previous and vector to create the next
3. Use the final step’s output to determine the result
⃗
word ⃗
state ⃗
state'
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
Recurrent neural networks
Dealing with sequence inputs
Context
Word
New context
35. Recurrent neural network
1. Handle words step-by-step.
2. Use previous and vector to create the next
3. Use the final step’s output to determine the result
⃗
word ⃗
state ⃗
state'
Recurrent neural networks
Dealing with sequence inputs
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
h<0> f f
Very good
1: positive
h<1>
h<2>
h<0>
36. Recurrent neural network
1. Handle words step-by-step.
2. Use previous and vector to create the next
3. Use the final step’s output to determine the result
⃗
word ⃗
state ⃗
state'
̂
y(s) = P(sentiment|review)
= h<2>
= f(x<2>
, h<1>
)
= f(x<2>
, f(x<1>
, h<0>
))
Recurrent neural networks
Dealing with sequence inputs
f f
Very good
1: positive
h<1>
h<2>
h<0>
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
37. f f
Very good
1: positive
h<1>
h<2>
h<0>
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
Recurrent neural network
1. Handle words step-by-step.
2. Use previous and vector to create the next
3. Use the final step’s output to determine the result
⃗
word ⃗
state ⃗
state'
Step 1
Step 1
Recurrent neural networks
Dealing with sequence inputs
̂
y(s) = P(sentiment|review)
= h<2>
= f(x<2>
, h<1>
)
= f(x<2>
, f(x<1>
, h<0>
))
38. f f
Very good
1: positive
h<1>
h<2>
h<0>
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
Recurrent neural network
1. Handle words step-by-step.
2. Use previous and vector to create the next
3. Use the final step’s output to determine the result
⃗
word ⃗
state ⃗
state'
Step 1
Step 1
Recurrent neural networks
Dealing with sequence inputs
̂
y(s) = P(sentiment|review)
= h<2>
= f(x<2>
, h<1>
)
= f(x<2>
, f(x<1>
, h<0>
))
Step 2
Step 2
41. But how can we produce sequence outputs?
Seq2seq
Producing sequence outputs
42. Same principal: produce one word at a time
Seq2seq
Producing sequence outputs
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
43. Same principal: produce one word at a time
Seq2seq
Producing sequence outputs
Words are read by encoder RNN
and is updated
⃗
state
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
44. Same principal: produce one word at a time
Seq2seq
Producing sequence outputs
Final of encoder is fed
as the initial of the decoder
⃗
state
⃗
state0
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
45. Same principal: produce one word at a time
Seq2seq
Producing sequence outputs
Decoder RNN does its thing:
Emits output word one at a time depending on the
is also updated for the next step.
⃗
state
⃗
state
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
46. Same principal: produce one word at a time
Seq2seq
Producing sequence outputs
Terminates when special token is emitted
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
49. Words can be too far away!
Attention
Handling long dependency
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
50. Attention
Handling long dependency
P( ⃗
h ′

t | ⃗
y t−1, ⃗
h ′

t−1, ⃗
θ )
Words can be too far away!
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
51. Words can be too far away!
Attention
Handling long dependency
P( ⃗
h ′

t | ⃗
y t−1, ⃗
h ′

t−1, ⃗
θ )
⃗
h ′

t accounts ⃗
x i ∈ X, ⃗
y i<t
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
52. Words can be too far away!
Attention
Handling long dependency
P( ⃗
h ′

t | ⃗
y t−1, ⃗
h ′

t−1, ⃗
θ )
⃗
h ′

t accounts ⃗
x i ∈ X, ⃗
y i<t
- Questions and answers can easily have 20 words
- Relying on a 1 context vector to account for context of 30+ words is not optimal
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8
55. Attention
Handling long dependency
Taylor made context vector for each step!
👈 no details today
Image source: https://www.tensor
fl
ow.org/text/tutorials/nmt_with_attention
56. Attention-based model transformer greatly improved NLP performance
👉Parallel encoding is possible (decoding is still auto-regressive)
👉SOTA performance in multitude of tasks
👉Performance scales indefinitely with size of data and number of parameters
Transformer
What empowered seq2seq framework further
57. Attention-based model transformer greatly improved NLP performance
👉Parallel encoding is possible (decoding is still auto-regressive)
👉SOTA performance in multitude of tasks
👉Performance scales indefinitely with size of data and number of parameters
Transformer
What empowered seq2seq framework further
58. State of the art
GPT-3
Language models are
fl
exible task solvers!
Source: https://beta.openai.com/examples
59. State of the art
GPT-3
Source: https://beta.openai.com/examples
60. State of the art
Codex (GPT-3 descendant)
Image source: Github Co-pilot
64. State of the art
DALL-E 2
Image source: https://openai.com/dall-e-2/
NLP model as encoder Generative model as decoder
65. State of the art
DALL-E 2
Image source: https://openai.com/dall-e-2/
NLP model as encoder Generative model as decoder
66. State of the art
Midjourney examples
@midjourneyartwork
67. State of the art
Midjourney examples
@midjourney.architecture
68. Recap
Big ideas in NLP
A. Language is important, but hard to compute
A. Context, nuances, -many possible sentences
B. Word2vec creates a mean to map words to meaning vectors
A. Allows computational representation
C. RNN can read sentences at word space
A. Compute friendly
D. Seq2seq provides a way to generate sentences
A. More
fl
exibility
E. Attention lets you handle long sentences.
∞
69. Recap
Big ideas in NLP
A. Language is important, but hard to compute
A. Context, nuances, -many possible sentences
B. Word2vec creates a mean to map words to meaning vectors
A. Allows computational representation
C. RNN can read sentences at word space
A. Compute friendly
D. Seq2seq provides a way to generate sentences
A. More
fl
exibility
E. Attention lets you handle long sentences
∞
70. Recap
Big ideas in NLP
A. Language is important, but hard to compute
A. Context, nuances, -many possible sentences
B. Word2vec creates a mean to map words to meaning vectors
A. Allows computational representation
C. RNN can read sentences at word space
A. Compute friendly
D. Seq2seq provides a way to generate sentences
A. More
fl
exibility
E. Attention lets you handle long sentences
∞
Simple and
effective ideas
changed the game
74. John Carmack on Lex Fridman’s podcast: https://youtu.be/xLi83prR5fg
He also said (something along these lines):
“The remaining ideas are simple enough to be written down on a back of an envelope”
“The code for AGI will be ~10,000 lines of code and will take one man to implement it”
The father of FPS
75. Where to learn more?
A. Stanford CS224n: Natural Language Processing - lecture
B. Unreasonable E
ff
ectiveness of Recurrent Neural Networks - article
C. Illustrated Transformer - article
D. Speech and Text Processing - book
E. MIT 6.S191: Introduction to Deep Learning - lecture