Contextual vs non-contextual word embedding models for Hindi Named Entity Recognition | Natural Language Processing- Aindriya Barua

6th December, 2020
Center for Computational Engineering and Networking (CEN)
Aindriya Barua
Slide-
6th December, 2020
IACC - 2020
129: Analysis of Contextual and Non-Contextual Word
Embedding Models For Hindi NER With Web Application
Paper Presentation
1
AINDRIYA BARUA
(barua.aindriya@gmail.com)
Dr. KP Soman (HoD, CEN),
Thara S
Premjith B.

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
2
PDF of full paper is publicly available here
All the code used in the research are available on my Github
Full Hindi NER dataset is available here
If you find any part of the resources provided here useful for your research, please cite
this paper:
Barua, A., Thara, S., Premjith, B. and Soman, K.P., 2020, December. Analysis of Contextual and Non-
contextual Word Embedding Models for Hindi NER with Web Application for Data Collection. In
International Advanced Computing Conference (pp. 183-202). Springer, Singapore.
RESOURCES

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
3
ABSTRACT
1. Categorize word embeddings as Contextual and Non-contextual, and further
compare them inter and intra-category for NER in Hindi in Devanagari script.
2. Under non-contextual type embeddings, we experiment with Word2Vec and
FastText
3. Under the contextual embedding category, we experiment with BERT and its
variants RoBERTa, ELECTRA, CamemBERT, Distil-BERT, XLM-RoBERTa.
4. Best model is used to make an interactive web app for hindi NER

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
4
Dataset
● The dataset is taken from the
first shared task on
Information Extractor for
Conversational Systems in
Indian Languages (IECSIL)
● Consists of Hindi words and
corresponding NER labels.
● 15,48,570 words and labels.

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
5

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
6
Word Embedding Classification
● Non Contextual : produces only one vector paying little heed to the position of
the words in a sentence and disregards the various implications they may have.
● Contextual : Contextual Word Embedding Models can produce distinctive word
embeddings for a word that catches the positioning of a word in a sentence,
hence they are context-dependent. We will be analyzing Transformer based word
embedding models which are contextual.

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
7
Word2Vec and Fasttext: (Non contextual)
● Word2vec trains words against other
words that neighbor them in the input
corpus.
● It does so in one of two ways, either
using context to predict a target word
(a method known as continuous bag of
words, or CBOW), or using a word to
predict a target context, which is called
skip-gram.

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
8
Word2Vec vs Fasttext:
● fastText treats each word as the aggregation of its subwords:
character ngrams of the word. The vector for a word is the sum
of all vectors
● fastText does significantly better on morphologically rich
languages
● fastText can be used to obtain vectors for out-of-vocabulary
(OOV) words, by summing up vectors for its component char-
ngrams, provided at least one of the char-ngrams was present in
the training data.

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
9
Sequence to Sequence Models
● Sequence-to-sequence (seq2seq) models : convert sequences of
Type A to sequences of Type B. For eg, translation of English
sentences to German sentences
● RNN is based on this

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
10
RNN based Sequence-to-Sequence Model
● RNN takes a word vector (xi) from the input sequence and a hidden state (Hi) from the
previous time step
● The hidden state from the last unit is known as the context vector.
● Context vector - passed to the decoder and it is used to generate the target sequence
(English phrase)

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
11
Downfalls of Seq2seq
● Dealing with long-range dependencies is still challenging
● The sequential nature of the model architecture prevents
parallelization. These challenges are addressed by Google Brain’s
Transformer concept

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
12
Transformers to understand BERT
● Capturing such relationships and sequence of words in sentences: This is where the
Transformer concept plays a major role.

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
13

23rd September, 2018
Aindriya Barua
Slide-
16th, May, 2020
14
BERT
Bidirectional- because it reads text
from both directions, at the same time,
left and right-> better understanding
Encoder: uses transformer’s encoder
Representations (with)
Transformers: BERT uses a multi-layer
bidirectional Transformer encoder. Its
self-attention layer performs self-
attention in both directions.

Aindriya Barua
Slide-
16th, May, 2020
15
RoBERTa: modication of BERT and enhances the key hyper-parameters which trains with way bigger
learning rates and mini-batches
XLM-RoBERTa: a huge multi-lingual model, pre-trained on a large amount of data, that does not
require special tensors to determine the language
CamemBERT: based on Facebook's RoBERTa, but trained on French language
DistilBERT: a comparatively very small, quick, less expensive, and Light-weight Transformer model
trained by distillation of Bert
ELECTRA: a pre-training methodology that trains 2 transformers: generator and discriminator. The
generator replaces tokens in a sequence, and is trained as a masked model. The discriminator identies
which tokens in the sequence were initially replaced by the generator.

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
16
DATA FLOW DIAGRAM FOR NON CONTEXTUAL MODELS

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
17
DATA FLOW DIAGRAM FOR CONTEXTUAL MODELS

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
18

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
19
INTRA CATEGORY COMPARISON: NON CONTEXTUAL

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
20
INTRA CATEGORY COMPARISON: CONTEXTUAL

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
21
Hindi is morphologically rich and words are usually formed by a combination
of sub-words called `sandhi', which is is intuitively similar to the act of
breaking words into sub-words done by FastText
For OOV words, it sums up vectors for its component character n-grams,
if at least one of the n-gram is present in the training data, it can speculate
the representation of the new word with the help of that.

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
22
XLM-RoBERTa is a multi-lingual model trained over 100 languages over a signicantly larger data-set. The training
corpus called CommonCrawl is as huge as 2.5 gigabytes, which is a manifold higher than its predecessors'
training data- the Wiki-100 corpus.
BERT performs better than RoBERTa on Hindi NER by approximately 7%.
CamemBERT is trained on French monolingual data, and hence it is interesting to note its performance on Hindi
data. It shows a 17% degradation from BERT's F1 score.
DistilBERT: The execution time is approximately four times less than that of BERT, but it did come with the trade-
oFf prediction metrics. It shows massive 38% degradation on BERT in our training.
Although it i claimed that ELECTRA Model ha an improvement on BERT, it causes a degradation of 45 % on our
Hindi NER task.

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
23
INTER CATEGORY COMPARISON: CONTEXTUAL VS NON- CONTEXTUAL

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
24

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
25
WEB APP FOR INTERACTIVE HINDI NER AND DATA COLLECTION
After the successful completion of all the experiments, the best model was
used to make the first of its kind interactive web application for NER in the
Hindi Language in Devanagari script, which is deployed at http://3.7.28.233.

Aindriya Barua
Slide-
16th, May, 2020
26
Conclusion
By using these techniques, more applications can be made in the mother-tongues of
Indians, and bridging the language gap of Indians with the world, and also extract
valuable information from our ancient history.
Future work:
1. hyper-parameter tuning: making tweaks to the learning rates, batch sizes, etc.
2. dataset has class imbalance as established during the experiments, hence, a cost
sensitive learning approach could also yield better outcomes.
3. Reinforcement learning can also be incorporated on the web application, to improve
the models utilizing the user feedback that our website is designed to collect.

6th December, 2020
Aindriya Barua
Slide-
6th December, 2020
27
github.com/AindriyaBa
barua.aindriya@gmail.
Questions? Feel free to
reach me out on e-mail
or Github :)
If you find any part of the resources provided
here useful for your research, please site this
paper:
https://www.researchgate.net/publication/3491
90662_Analysis_of_Contextual_and_Non-

Contextual vs non-contextual word embedding models for Hindi Named Entity Recognition | Natural Language Processing- Aindriya Barua

Recommended

Recommended

More Related Content

Similar to Contextual vs non-contextual word embedding models for Hindi Named Entity Recognition | Natural Language Processing- Aindriya Barua

Similar to Contextual vs non-contextual word embedding models for Hindi Named Entity Recognition | Natural Language Processing- Aindriya Barua (20)

Recently uploaded

Recently uploaded (20)

Contextual vs non-contextual word embedding models for Hindi Named Entity Recognition | Natural Language Processing- Aindriya Barua