Natural Language Processing
Sandeep Malhotra
techBLEND Group Presentation Series, September 20, 2020
Goal
• Design algorithms to allow computers to understand language in
order to perform some tasks.
• Examples
• Spell checking
• Keyword Search
• Information parsing
• Machine Translation
• Semantic Analysis
• Question Answering
• ….
Main Approaches
• Rule-based methods
• Probabilistic Modelling & Machine Learning
• Deep Learning
What is text?
• A sequence of tokens
• A token can be a character, a word, a phrase etc.
Text Pre-processing
Vectorize
Normalize
Tokenize
Text
[0.4566,0.6879],[0.6789,0.2345],[0.4201,0.3456]
nlp, discuss, interest
NLP, discussion, is, interesting
NLP discussion is interesting
Hands-on
Text Pre-processing
Representing Words
• One hot vector
• Embeddings
One hot vectors
• Word is represented by a vector whose size is equivalent to
vocabulary ( V )
• In the vector, the value at the index of the word is 1, rest is 0
• High dimension
• Sparse vector
• No way to find similarity between words
• Tea and Coffee – no relationship
Word Embeddings
• Word is represented by a vector of size d, where d << V
• Each dimension corresponds to a different attribute of the word
• Low dimension
• Dense vector
• Similar words are closer in vector space
• Tea and Coffee will be similar
• Missing words marked as <UNK>
• Use sub-words
Word Embeddings Types
• Static
• A word will always have the same embedding
• Word ‘python’ in the sentences “He was bitten by a python” and “I know how to code in
python” will have same embedding
• Examples
• WordVec, GloVe
• Contextual
• Embedding will depend on the context
• Word ‘python’ in the sentences “He was bitten by a python” and “I know how to code in
python” will have different embedding
• Examples
• BERT
Hands-on
Fun with Word Embeddings
Feed Forward Neural Networks
“Bad not good” is equivalent to “good not bad”
Recurrent Neural Networks
Remembers context (in practice, only immediate context), suffers from vanishing gradient problem
LSTM – Long Short Term Memory
Ref: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
GRU – Gated Recurring Unit
Ref: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Encoder – Decoder Architecture
Ref: https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129
• The input sentence is fed to the
encoder, and it’s vector
representation is passed to the
first unit in decoder
• The whole input sentence is
represented by a single vector
when passed to decoder, hence
can carry only limited information
Attention Mechanism
Ref: https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129
• Based on how much attention each word in
decoder (output) should pay to each word in
encoder (input)
• Decoder has access to all hidden states of
encoder
• Multiple attentions are possible e.g.
• cosine similarity
• Bahdanau Attention (Additive)
• Luong Attention (Multiplicative)
Transformer Architecture
Ref: https://arxiv.org/abs/1706.03762
• Uses Self-attention i.e how much attention
word pays to other words in the same
sentence
• Encoder processes the words parallelly rather
than sequentially (as is the case with recurrent
networks)
• Position Encoding is used for relative position
of words in the sentence
Thank You

Natural Language Processing

  • 1.
    Natural Language Processing SandeepMalhotra techBLEND Group Presentation Series, September 20, 2020
  • 2.
    Goal • Design algorithmsto allow computers to understand language in order to perform some tasks. • Examples • Spell checking • Keyword Search • Information parsing • Machine Translation • Semantic Analysis • Question Answering • ….
  • 3.
    Main Approaches • Rule-basedmethods • Probabilistic Modelling & Machine Learning • Deep Learning
  • 4.
    What is text? •A sequence of tokens • A token can be a character, a word, a phrase etc.
  • 5.
  • 6.
  • 7.
    Representing Words • Onehot vector • Embeddings
  • 8.
    One hot vectors •Word is represented by a vector whose size is equivalent to vocabulary ( V ) • In the vector, the value at the index of the word is 1, rest is 0 • High dimension • Sparse vector • No way to find similarity between words • Tea and Coffee – no relationship
  • 9.
    Word Embeddings • Wordis represented by a vector of size d, where d << V • Each dimension corresponds to a different attribute of the word • Low dimension • Dense vector • Similar words are closer in vector space • Tea and Coffee will be similar • Missing words marked as <UNK> • Use sub-words
  • 10.
    Word Embeddings Types •Static • A word will always have the same embedding • Word ‘python’ in the sentences “He was bitten by a python” and “I know how to code in python” will have same embedding • Examples • WordVec, GloVe • Contextual • Embedding will depend on the context • Word ‘python’ in the sentences “He was bitten by a python” and “I know how to code in python” will have different embedding • Examples • BERT
  • 11.
  • 12.
    Feed Forward NeuralNetworks “Bad not good” is equivalent to “good not bad”
  • 13.
    Recurrent Neural Networks Rememberscontext (in practice, only immediate context), suffers from vanishing gradient problem
  • 14.
    LSTM – LongShort Term Memory Ref: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
  • 15.
    GRU – GatedRecurring Unit Ref: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
  • 16.
    Encoder – DecoderArchitecture Ref: https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129 • The input sentence is fed to the encoder, and it’s vector representation is passed to the first unit in decoder • The whole input sentence is represented by a single vector when passed to decoder, hence can carry only limited information
  • 17.
    Attention Mechanism Ref: https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129 •Based on how much attention each word in decoder (output) should pay to each word in encoder (input) • Decoder has access to all hidden states of encoder • Multiple attentions are possible e.g. • cosine similarity • Bahdanau Attention (Additive) • Luong Attention (Multiplicative)
  • 18.
    Transformer Architecture Ref: https://arxiv.org/abs/1706.03762 •Uses Self-attention i.e how much attention word pays to other words in the same sentence • Encoder processes the words parallelly rather than sequentially (as is the case with recurrent networks) • Position Encoding is used for relative position of words in the sentence
  • 19.