Vectorization In NLP
By Chode Amarnath
Important links
→ https://www.turing.com/kb/guide-on-word-embeddings-in-nlp
USES
→ Bag of words: Extracts features from the text
→ TF-IDF: Information retrieval, keyword extraction
→ Word2Vec: Semantic analysis task
→ GloVe: Word analogy, named entity recognition tasks
→ BERT: language translation, question answering system
Index
→ Vectorization Techniques
→ Word Embedding Pre-trained Methods
Vectorization
The process of converting word into numbers are called Vectorization
→ It's easy for us to understand the sentence as we know the
semantics of the words and the sentence. But how can any program (eg: python)
interpret this sentence?
→ It is easier for any programming language to understand textual
data in the form of numerical value. So, for this reason, we need to vectorize all of the
text so that it is better represented.
To convert string data into numerical data one can use the following data.
1. Bag of words.
2. TFIDF.
3. Word2Vec
What are N-grams and Why do we use them
An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram)
is a two-word sequence of words like “really good”, “not good”, or “your homework”, and
a 3-gram (more commonly called a trigram) is a three-word sequence of words like “not
at all”, or “turn off light”.
We can conclude that “Bag of bigrams ” is more powerful than “Bag of words”, and many
cases it is very hard to beat
Bag of Words
It is basic model used in natural language processing.
→ Why it is called bag of words because any order of the words in the
document is discarded it only tells us whether word is present in the document or not
Eg:
“There used to be Stone Age”
“There used to be Bronze Age”
“There used to be Iron Age”
“There was Age of Revolution”
“Now it is Digital Age”
Here each sentence is separate document if we make a list of words such that one
word should be occur only once than our list
“There”,”was”,”to”,”be”,”used”,”Stone”,”Bronze,”Iron”,”Revolution”,”Digital”,”Age”,”of”,”No
w”,”it”,”is”
where we count occurrence of word in a document w.r.t list. For example- vector
conversion of sentence “There used to be Stone Age” can be represented as :
So here we basically convert word into vector .
By following same approach other vector value are as follow:
“There used to be bronze age” = [1,0,1,1,1,0,1,0,0,0,1,0,0,0,0]
“There used to be iron age” = [1,0,1,1,1,0,0,1,0,0,1,0,0,0,0]
“There was age of revolution” = [1,1,0,0,0,0,0,0,1,0,1,1,0,0,0]
“Now its digital Age” = [0,0,0,0,0,0,0,0,0,1,1,0,1,1,1]
The approach which is discussed above is unigram because we are considering only one
word at a time . Similarly we have bigram(using two words at a time- for example —
There used, Used to, to be, be Stone, Stone age), trigram(using three words at a time- for
example- there used to, used to be ,to be Stone,be Stone Age), ngram(using n words at a
time)
By using CountVectorizer function we can convert text document to matrix of word count. Matrix
which is produced here is sparse matrix.
CountVectorizer
Convert a collection of text documents to a matrix of token counts.
But its major disadvantages are:
→ Its inability in identifying more important and less important words for analysis.
→ It will just consider words that are abundant in a corpus as the most statistically
significant word.
→ It also doesn't identify the relationships between words such as linguistic
similarity between words.
TF-IDF
TF-IDF stands for Term Frequency-Inverse Document Frequency which basically tells
importance of the word in the corpus or dataset.
→ TF-IDF contain two concept Term Frequency(TF) and Inverse Document
Frequency(IDF).
Term - Frequency
This Measure the frequency of a word in a document .
Inverse Document Frequency
Inverse document frequency is another concept which is used for finding out importance
of the word.
→ It is based on the fact that less frequent words are more
informative and important
TF-IDF
It basically reduces values of common word that are used in different document.
Difference Between Count Vectorizer and TF-IDF
→ TF-IDF Vectorizer and Countvectorizer are both methods used in natural
language processing to vectorize text. However, there is a fundamental difference
between the two methods.
→ CountVectorizer simply counts the number of times a word appears in a
document (using a bag-of-words approach), while TF-IDF Vectorizer takes into account
not only how many times a word appears in a document but also how important that
word is to the whole corpus.
Word Embedding(https://towardsdatascience.com/word2vec-explained-49c52b4ccb71)
Word embedding is a technique where individual words are transformed into a numerical
representation of a word(a Vector).
→ The vector try to capture various characteristics of that word with regards
to the overall text. These characteristics include the semantic relationship of the word,
definition, context etc.
→ With these numerical representation we can do many things like identify
similarity or dissimilarity between words.
→ In this approach, words and documents are represented in the form of
numeric vectors allowing similar words to have similar vector representations.
Table of Content
→ What are Pre trained Word Embeddings?
→ Why do we need Pre trained Word Embeddings?
→ What are the Different Pre trained Word Embeddings?
→ Google’s Word2vec
→ Stanford’s GloVe
→ Case Study: Learning Embeddings from Scratch vs Pre trained Word Embeddings
What are the Different Pre trained Word Embeddings?
Embeddings are divided into 2 classes:
→ Word-level
→ Character-level embeddings.
→ ELMo and Flair embeddings are examples of Character-level embeddings.
In this article, we are going to cover two popular word-level pre trained word
embeddings:
→ Google’s Word2Vec
→ Stanford’s GloVe
Word2Vec
Word2Vec is one of the most popular pre trained word embeddings developed by
Google.
→ Word2Vec is trained on the Google News dataset (about 100 billion words).
→ It has several use cases such as Recommendation Engines, Knowledge
Discovery, and also applied in the different Text Classification problems..
→ Did you observe that we didn’t get any semantic meaning from words of
corpus by using previous methods? But for most of the applications of NLP tasks like
sentiment classification, sarcasm detection etc require semantic meaning of a word
and semantic relationships of a word with other words.
By using Bag-of-words and TF-IDF techniques we can not capture the meaning or
relation of the words from vectors.
→ Word2vec constructs such vectors called embeddings.
The Word2vec method learns all those types of relationships of words while building a
model.
For this purpose word2vec uses 2 types of methods. There are
→ Skip-gram
→ CBOW (Continuous Bag of Words)
CBOW (Continuous Bag of Words)
In CBOW model we essentially tries to predict a target word from a list of context words.
→ In CBOW, we define a window size. The middle word is the current word and
the surrounding words (past and future words) are the context. CBOW utilizes the
context to predict the current words. Each word is encoded using One Hot Encoding in
the defined vocabulary and sent to the CBOW neural network.
The hidden layer is a standard fully-connected dense layer.
The output layer generates probabilities for the target word from the vocabulary.
Continuous Skip- gram model
The skip gram model is a simple neural network with one hidden layer trained in order to
predict the probability of a given word
→ The Skip-Gram model being the opposite of the CBOW model
→ It takes the current word as an input and tries to accurately predict
the words before and after the current word
LINKS
https://www.mygreatlearning.com/blog/bag-of-words/ - Good for bag of words
https://www.analyticsvidhya.com/blog/2021/11/how-sklearns-tfidfvectorizer-
calculates-tf-idf-values/ - good for TF-IDF
Embedding for spelling correction
https://towardsdatascience.com/embedding-for-spelling-correction-92c93f835d79
https://dataaspirant.com/word-embedding-techniques-nlp/
Code link for embedding Technique
https://dataaspirant.com/word-embedding-techniques-nlp/
Pre trained model detailed explanation
https://www.analyticsvidhya.com/blog/2020/03/pretrained-word-embeddings-
nlp/#:~:text=Let's%20answer%20the%20big%20question,used%20for%20solving%20other%20tasks.

Vectorization In NLP.pptx

  • 1.
  • 2.
  • 3.
    USES → Bag ofwords: Extracts features from the text → TF-IDF: Information retrieval, keyword extraction → Word2Vec: Semantic analysis task → GloVe: Word analogy, named entity recognition tasks → BERT: language translation, question answering system
  • 4.
    Index → Vectorization Techniques →Word Embedding Pre-trained Methods
  • 5.
    Vectorization The process ofconverting word into numbers are called Vectorization → It's easy for us to understand the sentence as we know the semantics of the words and the sentence. But how can any program (eg: python) interpret this sentence? → It is easier for any programming language to understand textual data in the form of numerical value. So, for this reason, we need to vectorize all of the text so that it is better represented.
  • 6.
    To convert stringdata into numerical data one can use the following data. 1. Bag of words. 2. TFIDF. 3. Word2Vec
  • 11.
    What are N-gramsand Why do we use them
  • 12.
    An N-gram isan N-token sequence of words: a 2-gram (more commonly called a bigram) is a two-word sequence of words like “really good”, “not good”, or “your homework”, and a 3-gram (more commonly called a trigram) is a three-word sequence of words like “not at all”, or “turn off light”.
  • 13.
    We can concludethat “Bag of bigrams ” is more powerful than “Bag of words”, and many cases it is very hard to beat
  • 14.
    Bag of Words Itis basic model used in natural language processing. → Why it is called bag of words because any order of the words in the document is discarded it only tells us whether word is present in the document or not Eg: “There used to be Stone Age” “There used to be Bronze Age” “There used to be Iron Age” “There was Age of Revolution” “Now it is Digital Age”
  • 15.
    Here each sentenceis separate document if we make a list of words such that one word should be occur only once than our list “There”,”was”,”to”,”be”,”used”,”Stone”,”Bronze,”Iron”,”Revolution”,”Digital”,”Age”,”of”,”No w”,”it”,”is” where we count occurrence of word in a document w.r.t list. For example- vector conversion of sentence “There used to be Stone Age” can be represented as :
  • 16.
    So here webasically convert word into vector . By following same approach other vector value are as follow: “There used to be bronze age” = [1,0,1,1,1,0,1,0,0,0,1,0,0,0,0] “There used to be iron age” = [1,0,1,1,1,0,0,1,0,0,1,0,0,0,0] “There was age of revolution” = [1,1,0,0,0,0,0,0,1,0,1,1,0,0,0] “Now its digital Age” = [0,0,0,0,0,0,0,0,0,1,1,0,1,1,1]
  • 17.
    The approach whichis discussed above is unigram because we are considering only one word at a time . Similarly we have bigram(using two words at a time- for example — There used, Used to, to be, be Stone, Stone age), trigram(using three words at a time- for example- there used to, used to be ,to be Stone,be Stone Age), ngram(using n words at a time) By using CountVectorizer function we can convert text document to matrix of word count. Matrix which is produced here is sparse matrix.
  • 18.
    CountVectorizer Convert a collectionof text documents to a matrix of token counts. But its major disadvantages are: → Its inability in identifying more important and less important words for analysis. → It will just consider words that are abundant in a corpus as the most statistically significant word. → It also doesn't identify the relationships between words such as linguistic similarity between words.
  • 20.
    TF-IDF TF-IDF stands forTerm Frequency-Inverse Document Frequency which basically tells importance of the word in the corpus or dataset. → TF-IDF contain two concept Term Frequency(TF) and Inverse Document Frequency(IDF).
  • 21.
    Term - Frequency ThisMeasure the frequency of a word in a document .
  • 23.
    Inverse Document Frequency Inversedocument frequency is another concept which is used for finding out importance of the word. → It is based on the fact that less frequent words are more informative and important
  • 24.
    TF-IDF It basically reducesvalues of common word that are used in different document.
  • 25.
    Difference Between CountVectorizer and TF-IDF → TF-IDF Vectorizer and Countvectorizer are both methods used in natural language processing to vectorize text. However, there is a fundamental difference between the two methods. → CountVectorizer simply counts the number of times a word appears in a document (using a bag-of-words approach), while TF-IDF Vectorizer takes into account not only how many times a word appears in a document but also how important that word is to the whole corpus.
  • 26.
    Word Embedding(https://towardsdatascience.com/word2vec-explained-49c52b4ccb71) Word embeddingis a technique where individual words are transformed into a numerical representation of a word(a Vector). → The vector try to capture various characteristics of that word with regards to the overall text. These characteristics include the semantic relationship of the word, definition, context etc. → With these numerical representation we can do many things like identify similarity or dissimilarity between words. → In this approach, words and documents are represented in the form of numeric vectors allowing similar words to have similar vector representations.
  • 27.
    Table of Content →What are Pre trained Word Embeddings? → Why do we need Pre trained Word Embeddings? → What are the Different Pre trained Word Embeddings? → Google’s Word2vec → Stanford’s GloVe → Case Study: Learning Embeddings from Scratch vs Pre trained Word Embeddings
  • 28.
    What are theDifferent Pre trained Word Embeddings? Embeddings are divided into 2 classes: → Word-level → Character-level embeddings. → ELMo and Flair embeddings are examples of Character-level embeddings. In this article, we are going to cover two popular word-level pre trained word embeddings: → Google’s Word2Vec → Stanford’s GloVe
  • 29.
    Word2Vec Word2Vec is oneof the most popular pre trained word embeddings developed by Google. → Word2Vec is trained on the Google News dataset (about 100 billion words). → It has several use cases such as Recommendation Engines, Knowledge Discovery, and also applied in the different Text Classification problems.. → Did you observe that we didn’t get any semantic meaning from words of corpus by using previous methods? But for most of the applications of NLP tasks like sentiment classification, sarcasm detection etc require semantic meaning of a word and semantic relationships of a word with other words.
  • 30.
    By using Bag-of-wordsand TF-IDF techniques we can not capture the meaning or relation of the words from vectors. → Word2vec constructs such vectors called embeddings. The Word2vec method learns all those types of relationships of words while building a model. For this purpose word2vec uses 2 types of methods. There are → Skip-gram → CBOW (Continuous Bag of Words)
  • 32.
    CBOW (Continuous Bagof Words) In CBOW model we essentially tries to predict a target word from a list of context words. → In CBOW, we define a window size. The middle word is the current word and the surrounding words (past and future words) are the context. CBOW utilizes the context to predict the current words. Each word is encoded using One Hot Encoding in the defined vocabulary and sent to the CBOW neural network.
  • 34.
    The hidden layeris a standard fully-connected dense layer. The output layer generates probabilities for the target word from the vocabulary.
  • 36.
    Continuous Skip- grammodel The skip gram model is a simple neural network with one hidden layer trained in order to predict the probability of a given word → The Skip-Gram model being the opposite of the CBOW model → It takes the current word as an input and tries to accurately predict the words before and after the current word
  • 38.
    LINKS https://www.mygreatlearning.com/blog/bag-of-words/ - Goodfor bag of words https://www.analyticsvidhya.com/blog/2021/11/how-sklearns-tfidfvectorizer- calculates-tf-idf-values/ - good for TF-IDF
  • 39.
    Embedding for spellingcorrection https://towardsdatascience.com/embedding-for-spelling-correction-92c93f835d79 https://dataaspirant.com/word-embedding-techniques-nlp/ Code link for embedding Technique https://dataaspirant.com/word-embedding-techniques-nlp/ Pre trained model detailed explanation https://www.analyticsvidhya.com/blog/2020/03/pretrained-word-embeddings- nlp/#:~:text=Let's%20answer%20the%20big%20question,used%20for%20solving%20other%20tasks.