Latest trends in NLP - Exploring BERT

Silversparro Technologies Pvt. Ltd.
Latest Trends in NLP
Deep Learning Intern
Milind Kudapa

Latest Trends in NLP
Exploring BERT
Deep Learning Intern
Sourya Dipta Das

Outline
● Finite State Automata
● Bag of Words - Naive Bayes approach
● Word2Vec - CBOW and SkipGram
● Seq2seq models
● Attention and Transformer
● ElMo and GPT
● BERT

Why Representation Learning?
● Unlike pixel values of images in Computer Vision, machines cannot understand
words as they are.
● Some form of representation in the form of numbers necessary for the machine
to understand.
● Hence, word embeddings

Finite State Automata
Examples: equal (adj1) + -al (q1) + -iz (q2) + -e (q3) = equalize
Rule Based

Bag of words and Naive Bayes
Working
● Vocabulary of known words
● Frequency of occurrence of words
Limitation
Naive Assumption
● Occurence of one word is independent of the
occurrences of all other words.
● Information on the order of words is lost
● OOV words cannot be modelled

Neural Models - Word2Vec
(Mikolov et. al. 2013)
King - Man + Woman = Queen
● First revolution in NLP as neural models were
used first time.
● CBOW - Predict word based on nearby context
words.
● SkipGram - Predict context words given the
target word.

Limitations of Word2Vec
There is no representation for out-of-vocabulary words (OOVs).
How to separate some opposite word pairs. For example, “good” and “bad” are usually
located very close to each other in the vector space, which may limit the
performance of word vectors in NLP tasks like sentiment analysis.
Embeddings are not context based, for e.g. the word ‘crane’ can be used in different
contexts but word2vec gives it the same representation, thus leading to loss of
information.

Seq2seq models
Silversparro Technologies
● Use of GRUs and LSTMs.
● Second Revolution in NLP
● Tasks such as Machine translation, Question
Answering, sentence classification etc. have
been achieved using these models.

ELMo, GPT - New Age in NLP
Silversparro Technologies
● Feature Based and Fine Tuning strategies.
● ELMo (Peters et. al) - feature based and GPT (Radford et. al.) - fine-tuning.
● They use unidirectional language models to learn general language
representations.
● ELMo uses bidirectional LSTM on a next word prediction task.
● In OpenAI GPT, the authors use a left-to-right architecture, and a Transformer as
a decoder.
Contextualized word embeddings

Bidirectional Encoder Representations
● Devlin et. al. 2018, Google Research
● The masked language model randomly
masks some of the tokens from the input,
and the objective is to predict the original
vocabulary id of the masked word based
only on its context.
● There are two steps in the framework: pre-
training and fine-tuning.
● Pre-training is first done on unlabeled data
on different tasks.
● For fine-tuning, the trained parameters
are first initialized and then the model is
fine tuned on different downstream tasks.
from Transformers - BERT

Model
Two models,
● BERT-Base - 12 encoder blocks - 110M
parameters
● BERT-Large - 24 encoder blocks - 340M
parameters
● BERT encoder is an semi-supervised model
trained on two tasks:
● Masked Language Model: 15% of tokens in a
sentence are masked [MASK] and the model
learns to predict the masked tokens.
● Next Sentence Prediction: The model is
trained to classify whether a particular
sentence follows the given sentence or not.
● For the pre-training corpus the authors used
the BooksCorpus (800M words) (Zhu et al.,
2015) and English Wikipedia (2,500M words).

Attention is all you need
Working of a Transformer:
● Uses Attention instead of Recurrent Units like
LSTM.
● Three trainable matrices are introduced,
Vaswani et. al.
● Queries, Keys, Values.
● Information regarding order of words is lost,
hence position embeddings are used.

Embedding from BERT

Our Task
● Customer Care call transcripts from large scale insurance aggregator.
● Hinglish (English + Hindi written in english)
● Task: Binary classification whether a given person will buy the service or not.
● Pre-trained model on 171M word corpus
● Achieved a pre-training accuracy of 75% on MLM.

XL-Net, bigger is better?
● Focus on no of parameters
● How they used autoencoders (read about them)
● Latest results

Latest trends in NLP - Exploring BERT

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Latest trends in NLP - Exploring BERT

Similar to Latest trends in NLP - Exploring BERT (20)

More from Silversparro Technologies

More from Silversparro Technologies (10)

Recently uploaded

Recently uploaded (20)

Latest trends in NLP - Exploring BERT

Editor's Notes