Vectorization In NLP.pptx

Vectorization In NLP
By Chode Amarnath

Important links
→ https://www.turing.com/kb/guide-on-word-embeddings-in-nlp

USES
→ Bag of words: Extracts features from the text
→ TF-IDF: Information retrieval, keyword extraction
→ Word2Vec: Semantic analysis task
→ GloVe: Word analogy, named entity recognition tasks
→ BERT: language translation, question answering system

Index
→ Vectorization Techniques
→ Word Embedding Pre-trained Methods

Vectorization
The process of converting word into numbers are called Vectorization
→ It's easy for us to understand the sentence as we know the
semantics of the words and the sentence. But how can any program (eg: python)
interpret this sentence?
→ It is easier for any programming language to understand textual
data in the form of numerical value. So, for this reason, we need to vectorize all of the
text so that it is better represented.

To convert string data into numerical data one can use the following data.
1. Bag of words.
2. TFIDF.
3. Word2Vec

What are N-grams and Why do we use them

An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram)
is a two-word sequence of words like “really good”, “not good”, or “your homework”, and
a 3-gram (more commonly called a trigram) is a three-word sequence of words like “not
at all”, or “turn off light”.

We can conclude that “Bag of bigrams ” is more powerful than “Bag of words”, and many
cases it is very hard to beat

Bag of Words
It is basic model used in natural language processing.
→ Why it is called bag of words because any order of the words in the
document is discarded it only tells us whether word is present in the document or not
Eg:
“There used to be Stone Age”
“There used to be Bronze Age”
“There used to be Iron Age”
“There was Age of Revolution”
“Now it is Digital Age”

Here each sentence is separate document if we make a list of words such that one
word should be occur only once than our list
“There”,”was”,”to”,”be”,”used”,”Stone”,”Bronze,”Iron”,”Revolution”,”Digital”,”Age”,”of”,”No
w”,”it”,”is”
where we count occurrence of word in a document w.r.t list. For example- vector
conversion of sentence “There used to be Stone Age” can be represented as :

So here we basically convert word into vector .
By following same approach other vector value are as follow:
“There used to be bronze age” = [1,0,1,1,1,0,1,0,0,0,1,0,0,0,0]
“There used to be iron age” = [1,0,1,1,1,0,0,1,0,0,1,0,0,0,0]
“There was age of revolution” = [1,1,0,0,0,0,0,0,1,0,1,1,0,0,0]
“Now its digital Age” = [0,0,0,0,0,0,0,0,0,1,1,0,1,1,1]

The approach which is discussed above is unigram because we are considering only one
word at a time . Similarly we have bigram(using two words at a time- for example —
There used, Used to, to be, be Stone, Stone age), trigram(using three words at a time- for
example- there used to, used to be ,to be Stone,be Stone Age), ngram(using n words at a
time)
By using CountVectorizer function we can convert text document to matrix of word count. Matrix
which is produced here is sparse matrix.

CountVectorizer
Convert a collection of text documents to a matrix of token counts.
But its major disadvantages are:
→ Its inability in identifying more important and less important words for analysis.
→ It will just consider words that are abundant in a corpus as the most statistically
significant word.
→ It also doesn't identify the relationships between words such as linguistic
similarity between words.

TF-IDF
TF-IDF stands for Term Frequency-Inverse Document Frequency which basically tells
importance of the word in the corpus or dataset.
→ TF-IDF contain two concept Term Frequency(TF) and Inverse Document
Frequency(IDF).

Term - Frequency
This Measure the frequency of a word in a document .

Inverse Document Frequency
Inverse document frequency is another concept which is used for finding out importance
of the word.
→ It is based on the fact that less frequent words are more
informative and important

TF-IDF
It basically reduces values of common word that are used in different document.

Difference Between Count Vectorizer and TF-IDF
→ TF-IDF Vectorizer and Countvectorizer are both methods used in natural
language processing to vectorize text. However, there is a fundamental difference
between the two methods.
→ CountVectorizer simply counts the number of times a word appears in a
document (using a bag-of-words approach), while TF-IDF Vectorizer takes into account
not only how many times a word appears in a document but also how important that
word is to the whole corpus.

Word Embedding(https://towardsdatascience.com/word2vec-explained-49c52b4ccb71)
Word embedding is a technique where individual words are transformed into a numerical
representation of a word(a Vector).
→ The vector try to capture various characteristics of that word with regards
to the overall text. These characteristics include the semantic relationship of the word,
definition, context etc.
→ With these numerical representation we can do many things like identify
similarity or dissimilarity between words.
→ In this approach, words and documents are represented in the form of
numeric vectors allowing similar words to have similar vector representations.

Table of Content
→ What are Pre trained Word Embeddings?
→ Why do we need Pre trained Word Embeddings?
→ What are the Different Pre trained Word Embeddings?
→ Google’s Word2vec
→ Stanford’s GloVe
→ Case Study: Learning Embeddings from Scratch vs Pre trained Word Embeddings

What are the Different Pre trained Word Embeddings?
Embeddings are divided into 2 classes:
→ Word-level
→ Character-level embeddings.
→ ELMo and Flair embeddings are examples of Character-level embeddings.
In this article, we are going to cover two popular word-level pre trained word
embeddings:
→ Google’s Word2Vec
→ Stanford’s GloVe

Word2Vec
Word2Vec is one of the most popular pre trained word embeddings developed by
Google.
→ Word2Vec is trained on the Google News dataset (about 100 billion words).
→ It has several use cases such as Recommendation Engines, Knowledge
Discovery, and also applied in the different Text Classification problems..
→ Did you observe that we didn’t get any semantic meaning from words of
corpus by using previous methods? But for most of the applications of NLP tasks like
sentiment classification, sarcasm detection etc require semantic meaning of a word
and semantic relationships of a word with other words.

By using Bag-of-words and TF-IDF techniques we can not capture the meaning or
relation of the words from vectors.
→ Word2vec constructs such vectors called embeddings.
The Word2vec method learns all those types of relationships of words while building a
model.
For this purpose word2vec uses 2 types of methods. There are
→ Skip-gram
→ CBOW (Continuous Bag of Words)

CBOW (Continuous Bag of Words)
In CBOW model we essentially tries to predict a target word from a list of context words.
→ In CBOW, we define a window size. The middle word is the current word and
the surrounding words (past and future words) are the context. CBOW utilizes the
context to predict the current words. Each word is encoded using One Hot Encoding in
the defined vocabulary and sent to the CBOW neural network.

The hidden layer is a standard fully-connected dense layer.
The output layer generates probabilities for the target word from the vocabulary.

Continuous Skip- gram model
The skip gram model is a simple neural network with one hidden layer trained in order to
predict the probability of a given word
→ The Skip-Gram model being the opposite of the CBOW model
→ It takes the current word as an input and tries to accurately predict
the words before and after the current word

LINKS
https://www.mygreatlearning.com/blog/bag-of-words/ - Good for bag of words
https://www.analyticsvidhya.com/blog/2021/11/how-sklearns-tfidfvectorizer-
calculates-tf-idf-values/ - good for TF-IDF

Embedding for spelling correction
https://towardsdatascience.com/embedding-for-spelling-correction-92c93f835d79
https://dataaspirant.com/word-embedding-techniques-nlp/
Code link for embedding Technique
https://dataaspirant.com/word-embedding-techniques-nlp/
Pre trained model detailed explanation
https://www.analyticsvidhya.com/blog/2020/03/pretrained-word-embeddings-
nlp/#:~:text=Let's%20answer%20the%20big%20question,used%20for%20solving%20other%20tasks.

Vectorization In NLP.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Vectorization In NLP.pptx

Similar to Vectorization In NLP.pptx (20)

More from Chode Amarnath

More from Chode Amarnath (6)

Recently uploaded

Recently uploaded (20)

Vectorization In NLP.pptx