This presentation gives an overview of the recent trends in representation learning in NLP and also looks at the recent BERT architecture in a much more detailed fashion along with the Transformer.
3. Silversparro Technologies Pvt. Ltd.
Outline
● Finite State Automata
● Bag of Words - Naive Bayes approach
● Word2Vec - CBOW and SkipGram
● Seq2seq models
● Attention and Transformer
● ElMo and GPT
● BERT
4. Silversparro Technologies Pvt. Ltd.
Why Representation Learning?
● Unlike pixel values of images in Computer Vision, machines cannot understand
words as they are.
● Some form of representation in the form of numbers necessary for the machine
to understand.
● Hence, word embeddings
6. Silversparro Technologies Pvt. Ltd.
Bag of words and Naive Bayes
Working
● Vocabulary of known words
● Frequency of occurrence of words
Limitation
Naive Assumption
● Occurence of one word is independent of the
occurrences of all other words.
● Information on the order of words is lost
● OOV words cannot be modelled
7. Silversparro Technologies Pvt. Ltd.
Neural Models - Word2Vec
(Mikolov et. al. 2013)
King - Man + Woman = Queen
● First revolution in NLP as neural models were
used first time.
● CBOW - Predict word based on nearby context
words.
● SkipGram - Predict context words given the
target word.
8. Silversparro Technologies Pvt. Ltd.
Limitations of Word2Vec
There is no representation for out-of-vocabulary words (OOVs).
How to separate some opposite word pairs. For example, “good” and “bad” are usually
located very close to each other in the vector space, which may limit the
performance of word vectors in NLP tasks like sentiment analysis.
Embeddings are not context based, for e.g. the word ‘crane’ can be used in different
contexts but word2vec gives it the same representation, thus leading to loss of
information.
9. Silversparro Technologies Pvt. Ltd.
Seq2seq models
Silversparro Technologies
● Use of GRUs and LSTMs.
● Second Revolution in NLP
● Tasks such as Machine translation, Question
Answering, sentence classification etc. have
been achieved using these models.
10. Silversparro Technologies Pvt. Ltd.
ELMo, GPT - New Age in NLP
Silversparro Technologies
● Feature Based and Fine Tuning strategies.
● ELMo (Peters et. al) - feature based and GPT (Radford et. al.) - fine-tuning.
● They use unidirectional language models to learn general language
representations.
● ELMo uses bidirectional LSTM on a next word prediction task.
● In OpenAI GPT, the authors use a left-to-right architecture, and a Transformer as
a decoder.
Contextualized word embeddings
12. Silversparro Technologies Pvt. Ltd.
Bidirectional Encoder Representations
● Devlin et. al. 2018, Google Research
● The masked language model randomly
masks some of the tokens from the input,
and the objective is to predict the original
vocabulary id of the masked word based
only on its context.
● There are two steps in the framework: pre-
training and fine-tuning.
● Pre-training is first done on unlabeled data
on different tasks.
● For fine-tuning, the trained parameters
are first initialized and then the model is
fine tuned on different downstream tasks.
from Transformers - BERT
13. Silversparro Technologies Pvt. Ltd.
Model
Two models,
● BERT-Base - 12 encoder blocks - 110M
parameters
● BERT-Large - 24 encoder blocks - 340M
parameters
● BERT encoder is an semi-supervised model
trained on two tasks:
● Masked Language Model: 15% of tokens in a
sentence are masked [MASK] and the model
learns to predict the masked tokens.
● Next Sentence Prediction: The model is
trained to classify whether a particular
sentence follows the given sentence or not.
● For the pre-training corpus the authors used
the BooksCorpus (800M words) (Zhu et al.,
2015) and English Wikipedia (2,500M words).
14. Silversparro Technologies Pvt. Ltd.
Attention is all you need
Working of a Transformer:
● Uses Attention instead of Recurrent Units like
LSTM.
● Three trainable matrices are introduced,
Vaswani et. al.
● Queries, Keys, Values.
● Information regarding order of words is lost,
hence position embeddings are used.
17. Silversparro Technologies Pvt. Ltd.
Our Task
● Customer Care call transcripts from large scale insurance aggregator.
● Hinglish (English + Hindi written in english)
● Task: Binary classification whether a given person will buy the service or not.
● Pre-trained model on 171M word corpus
● Achieved a pre-training accuracy of 75% on MLM.
18. Silversparro Technologies Pvt. Ltd.
XL-Net, bigger is better?
● Focus on no of parameters
● How they used autoencoders (read about them)
● Latest results
Editor's Notes
Converting all alphabet characters to lowercase, e.g. replacing “Word” with “word”
Using a predefined contractions dictionary map to expend contractions, e.g. replacing “shouldn’t” with “should not”
Replacing digits with a fixed token, e.g. converting “$ 350” to “$ ###”
We use a combination of three models, using Glove,Paragram and FastText to generate word embeddings.
We search for the original version,lowercase version,uppercase version,Capitalized version,stemmed version,lemmatized version and the corrected version in order to get the embedded vectors from these pre-trained embeddings.