CSCE181 Big ideas in NLP

Insoo Chung
Big ideas in NLP
Simple and e
ff
ective big ideas that led us here — CSCE181

My name is Insoo Chung
• Graduate student at Texas A&M
• Worked on Machine Translation @ Samsung Research
• On-device translation, simultaneous interpretation,
speech-to-speech translation
• Published at ACL, EMNLP, ICASSP, and WMT
• Interned with Siri @ Apple this Summer
• Performance analysis of on-device Siri
• NLP enthusiast!
Howdy

Big ideas in NLP
What we are covering today
1. Natural language processing — what and why
2. Deep learning — a tiny bit
3. Big ideas
👉 Word2Vec — words
👉 Recurrent neural networks — sentence input
👉 Sequence to sequence framework — sentence output
👉 Attention — long sentence

Natural language processing
De
fi
nition
Branch of AI, concerned with giving computers the ability to understand natural language
utterance → text → text → utterance
(usually)
🗣📱
De
fi
nition from: https://www.ibm.com/cloud/learn/natural-language-processing
+ generating natural language as well

Language and intelligence
Why?
+ Most knowledge is in stored in some form of a language → we want computers to exploit it
If computers could process NL, it would be more useful to people.

Why?
+ Most knowledge is in stored in some form of a language → we want computers to exploit it
If computers could process NL, it would be more useful to people.
https://dev.to/spectrumcetb/evolution-of-nlp-f54

Turing test
Linguistic ability ≈ Intelligence

Turing test
“I propose to consider the question, ‘Can machines think?’”, Turing
https://en.wikipedia.org/wiki/Turing_test
If C cannot determine which player is a human, machine passes the test
Linguistic ability ≈ Intelligence

Arti
fi
cial general intelligence

Arti
fi
Turing test is one of the tests used for con
fi
rming human-level AGI

Arti
fi
fi
Linguistic abilities of AI is important in moving towards the next step

Arti
fi
fi
Linguistic abilities of AI is important in moving towards the next step
But language is so weird!

Slide from Stanford’s CS241n course, original cartoon by Randall Monroe

Slide from Stanford’s CS241n course, original cartoon by Randall Monroe
Then deep learning
happened 💥

Use cases
Commercial products
Machine translation (text → text)
Voice assistants (utterance → utterance)
Source: (left) Google translate, (right) Apple Siri

Use cases
Commercial products
Machine translation (text → text)
Voice assistants (utterance → utterance)
Source: (left) Google translate, (right) Apple Siri
So what’s
under the hood?

Deep learning
… for NLP
NLP models can be viewed as a conditional probability function which can be learned using deep
learning.
f( ⃗
y | ⃗
x , ⃗
θ )

learning.
P( ⃗
y | ⃗
x , ⃗
θ )
With x as input and
𝛉
as model params,
f outputs most probable outcome y.
Deep learning
… for NLP

learning.
Deep learning provides an effective way to learn the model , given A LOT of data and computation.
P( ⃗
y | ⃗
x , ⃗
θ )
θ
With x as input and
𝛉
as model params,
f outputs most probable outcome y.
Deep learning
… for NLP

Big ideas in NLP
What we are covering today
👉Word2Vec
👉Recurrent neural networks
👉Sequence to sequence framework
👉Attention - a tiny bit

Word2Vec
Language representation in computers
Words are represented as vectors of numbers in NLP. How?
1. Words are associated with random vectors:
2. We go through many sentences and learn that predicts prev/next word probability correctly.
θ
Example from: https://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture01-wordvecs1.pdf
brown = [+0.3, − 0.4, + 0.2, − 0.3,...]T
fox = [−0.2, − 0.1, − 0.1, + 0.3,...]T

Word2Vec
Example from: https://web.stanford.edu/class/cs224n/slides/cs224n-2022-lecture01-wordvecs1.pdf
Words are represented as vectors of numbers in NLP. How?
1. Words are associated with random vectors:
2. We go through many sentences and learn that predicts prev/next word probability correctly.
3. The result?
• Word vectors populated in n-d space that holds semantic/syntactic meaning
θ
brown = [+0.3, − 0.4, + 0.2, − 0.3,...]T
fox = [−0.2, − 0.1, − 0.1, + 0.3,...]T
Language representation in computers

Word2Vec
Learned vectors
Semantically close words are near each other

Word2Vec
Learned vectors
Syntactic relationships are preserved with relative positioning
e.g. ⃗
slower − ⃗
slow ≈ ⃗
faster − ⃗
fast

Word2Vec
Learned vectors
We have
computable
representations
for words!
Syntactic relationships are preserved with relative positioning
e.g. ⃗
slower − ⃗
slow ≈ ⃗
faster − ⃗
fast

Recurrent neural networks
Dealing with sequence inputs
We now know to deal with words using adjacency stats, but how do we handle sentences?
→ Consider movie review sentiment analysis.

We now know to deal with words using adjacency stats, but how do we handle sentences?
→ Consider movie review sentiment analysis.
Negative

How do we deal with a sentence, i.e. a sequence of words?
→ If we consider every possible sentences, the possible # of inputs would be - intractable
∞
Very good 1: positive
I enjoyed this as much as my cat enjoys baths 0: negative
f( ⃗
y | ⃗
x , ⃗
θ )

How do we deal with a sentence, i.e. a sequence of words?
→ Break it down to word level: then the possible # of words wouldn’t be that many (~30K) - tractable!
[“Very”, “good”] 1: positive
[“I”, “enjoyed”, “this”, “as”, “much”, “as”, “my”,
“cat”, “enjoys”, “baths”]
0: negative
f( ⃗
y | ⃗
x , ⃗
θ )

Recurrent neural network
1. Handle words step-by-step.
2. Use previous and vector to create the next
3. Use the final step’s output to determine the result
⃗
word ⃗
state ⃗
state'
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )

⃗
word ⃗
state ⃗
state'
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
Context
Word
New context

⃗
word ⃗
state ⃗
state'
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
h<0> f f
Very good
1: positive
h<1>
h<2>
h<0>

⃗
word ⃗
state ⃗
state'
̂
y(s) = P(sentiment|review)
= h<2>
= f(x<2>
, h<1>
)
= f(x<2>
, f(x<1>
, h<0>
))
f f
Very good
1: positive
h<1>
h<2>
h<0>
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )

f f
Very good
1: positive
h<1>
h<2>
h<0>
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
⃗
word ⃗
state ⃗
state'
Step 1
Step 1
̂
= h<2>
= f(x<2>
, h<1>
)
= f(x<2>
, f(x<1>
, h<0>
))

f f
Very good
1: positive
h<1>
h<2>
h<0>
P( ⃗
h t+1 | ⃗
x t, ⃗
h t, ⃗
θ )
⃗
word ⃗
state ⃗
state'
Step 1
Step 1
̂
= h<2>
= f(x<2>
, h<1>
)
= f(x<2>
, f(x<1>
, h<0>
))
Step 2
Step 2

Use case

Use case
We can read
-many
possible
sentences!
∞

But how can we produce sequence outputs?
Seq2seq
Producing sequence outputs

Same principal: produce one word at a time
Seq2seq
Image source: https://towardsdatascience.com/sequence-to-sequence-tutorial-4fde3ee798d8

Seq2seq
Words are read by encoder RNN
and is updated
⃗
state

Seq2seq
Final of encoder is fed
as the initial of the decoder
⃗
state
⃗
state0

Seq2seq
Decoder RNN does its thing:
Emits output word one at a time depending on the
is also updated for the next step.
⃗
state
⃗
state

Seq2seq
Terminates when special token is emitted

Seq2seq
Terminates when special token is emitted
We can generate
-many
possible
sentences!
∞

Attention
Handling long dependency
Words can be too far away!

Attention

Attention
P( ⃗
h ′

t | ⃗
y t−1, ⃗
h ′

t−1, ⃗
θ )

Attention
P( ⃗
h ′

t | ⃗
y t−1, ⃗
h ′

t−1, ⃗
θ )
⃗
h ′

t accounts ⃗
x i ∈ X, ⃗
y i<t

Attention
P( ⃗
h ′

t | ⃗
y t−1, ⃗
h ′

t−1, ⃗
θ )
⃗
h ′

t accounts ⃗
x i ∈ X, ⃗
y i<t
- Questions and answers can easily have 20 words
- Relying on a 1 context vector to account for context of 30+ words is not optimal

Attention
Image source: https://www.tensor
fl
ow.org/text/tutorials/nmt_with_attention
Taylor made context vector for each step!

Attention
fl

Attention
👈 no details today
fl

Attention-based model transformer greatly improved NLP performance
👉Parallel encoding is possible (decoding is still auto-regressive)
👉SOTA performance in multitude of tasks
👉Performance scales indefinitely with size of data and number of parameters
Transformer
What empowered seq2seq framework further

State of the art
GPT-3
Language models are
fl
exible task solvers!
Source: https://beta.openai.com/examples

State of the art
GPT-3
Source: https://beta.openai.com/examples

State of the art
Codex (GPT-3 descendant)
Image source: Github Co-pilot

Use cases
https://ai.googleblog.com/2019/05/introducing-translatotron-end-to-end.html
Audio = array of spectrogram patches
No textual representation in-between audio in/out
Di
ff
erent modalities
The sequence view of NLP provides a useful view for other modalities!

Use cases
Di
ff
erent modalities
https://arxiv.org/pdf/2010.11929v2.pdf
Image = array of image patches

Use cases
https://arxiv.org/pdf/2102.00719.pdf
Video = array of images
Di
ff
erent modalities

State of the art
DALL-E 2
Image source: https://openai.com/dall-e-2/
NLP model as encoder Generative model as decoder

State of the art
Midjourney examples
@midjourneyartwork

State of the art
Midjourney examples
@midjourney.architecture

Recap
Big ideas in NLP
A. Language is important, but hard to compute
A. Context, nuances, -many possible sentences
B. Word2vec creates a mean to map words to meaning vectors
A. Allows computational representation
C. RNN can read sentences at word space
A. Compute friendly
D. Seq2seq provides a way to generate sentences
A. More
fl
exibility
E. Attention lets you handle long sentences.
∞

Recap
Big ideas in NLP
A. Compute friendly
A. More
fl
exibility
E. Attention lets you handle long sentences
∞

Recap
Big ideas in NLP
A. Compute friendly
A. More
fl
exibility
E. Attention lets you handle long sentences
∞
Simple and
effective ideas
changed the game

John Carmack on Lex Fridman’s podcast: https://youtu.be/xLi83prR5fg
He also said (something along these lines):
“The remaining ideas are simple enough to be written down on a back of an envelope”
“The code for AGI will be ~10,000 lines of code and will take one man to implement it”
The father of FPS

Where to learn more?
A. Stanford CS224n: Natural Language Processing - lecture
B. Unreasonable E
ff
ectiveness of Recurrent Neural Networks - article
C. Illustrated Transformer - article
D. Speech and Text Processing - book
E. MIT 6.S191: Introduction to Deep Learning - lecture

CSCE181 Big ideas in NLP

More Related Content

Similar to CSCE181 Big ideas in NLP

Recently uploaded

CSCE181 Big ideas in NLP