SlideShare a Scribd company logo
Vectorization In NLP
By Chode Amarnath
Important links
→ https://www.turing.com/kb/guide-on-word-embeddings-in-nlp
USES
→ Bag of words: Extracts features from the text
→ TF-IDF: Information retrieval, keyword extraction
→ Word2Vec: Semantic analysis task
→ GloVe: Word analogy, named entity recognition tasks
→ BERT: language translation, question answering system
Index
→ Vectorization Techniques
→ Word Embedding Pre-trained Methods
Vectorization
The process of converting word into numbers are called Vectorization
→ It's easy for us to understand the sentence as we know the
semantics of the words and the sentence. But how can any program (eg: python)
interpret this sentence?
→ It is easier for any programming language to understand textual
data in the form of numerical value. So, for this reason, we need to vectorize all of the
text so that it is better represented.
To convert string data into numerical data one can use the following data.
1. Bag of words.
2. TFIDF.
3. Word2Vec
What are N-grams and Why do we use them
An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram)
is a two-word sequence of words like “really good”, “not good”, or “your homework”, and
a 3-gram (more commonly called a trigram) is a three-word sequence of words like “not
at all”, or “turn off light”.
We can conclude that “Bag of bigrams ” is more powerful than “Bag of words”, and many
cases it is very hard to beat
Bag of Words
It is basic model used in natural language processing.
→ Why it is called bag of words because any order of the words in the
document is discarded it only tells us whether word is present in the document or not
Eg:
“There used to be Stone Age”
“There used to be Bronze Age”
“There used to be Iron Age”
“There was Age of Revolution”
“Now it is Digital Age”
Here each sentence is separate document if we make a list of words such that one
word should be occur only once than our list
“There”,”was”,”to”,”be”,”used”,”Stone”,”Bronze,”Iron”,”Revolution”,”Digital”,”Age”,”of”,”No
w”,”it”,”is”
where we count occurrence of word in a document w.r.t list. For example- vector
conversion of sentence “There used to be Stone Age” can be represented as :
So here we basically convert word into vector .
By following same approach other vector value are as follow:
“There used to be bronze age” = [1,0,1,1,1,0,1,0,0,0,1,0,0,0,0]
“There used to be iron age” = [1,0,1,1,1,0,0,1,0,0,1,0,0,0,0]
“There was age of revolution” = [1,1,0,0,0,0,0,0,1,0,1,1,0,0,0]
“Now its digital Age” = [0,0,0,0,0,0,0,0,0,1,1,0,1,1,1]
The approach which is discussed above is unigram because we are considering only one
word at a time . Similarly we have bigram(using two words at a time- for example —
There used, Used to, to be, be Stone, Stone age), trigram(using three words at a time- for
example- there used to, used to be ,to be Stone,be Stone Age), ngram(using n words at a
time)
By using CountVectorizer function we can convert text document to matrix of word count. Matrix
which is produced here is sparse matrix.
CountVectorizer
Convert a collection of text documents to a matrix of token counts.
But its major disadvantages are:
→ Its inability in identifying more important and less important words for analysis.
→ It will just consider words that are abundant in a corpus as the most statistically
significant word.
→ It also doesn't identify the relationships between words such as linguistic
similarity between words.
TF-IDF
TF-IDF stands for Term Frequency-Inverse Document Frequency which basically tells
importance of the word in the corpus or dataset.
→ TF-IDF contain two concept Term Frequency(TF) and Inverse Document
Frequency(IDF).
Term - Frequency
This Measure the frequency of a word in a document .
Inverse Document Frequency
Inverse document frequency is another concept which is used for finding out importance
of the word.
→ It is based on the fact that less frequent words are more
informative and important
TF-IDF
It basically reduces values of common word that are used in different document.
Difference Between Count Vectorizer and TF-IDF
→ TF-IDF Vectorizer and Countvectorizer are both methods used in natural
language processing to vectorize text. However, there is a fundamental difference
between the two methods.
→ CountVectorizer simply counts the number of times a word appears in a
document (using a bag-of-words approach), while TF-IDF Vectorizer takes into account
not only how many times a word appears in a document but also how important that
word is to the whole corpus.
Word Embedding(https://towardsdatascience.com/word2vec-explained-49c52b4ccb71)
Word embedding is a technique where individual words are transformed into a numerical
representation of a word(a Vector).
→ The vector try to capture various characteristics of that word with regards
to the overall text. These characteristics include the semantic relationship of the word,
definition, context etc.
→ With these numerical representation we can do many things like identify
similarity or dissimilarity between words.
→ In this approach, words and documents are represented in the form of
numeric vectors allowing similar words to have similar vector representations.
Table of Content
→ What are Pre trained Word Embeddings?
→ Why do we need Pre trained Word Embeddings?
→ What are the Different Pre trained Word Embeddings?
→ Google’s Word2vec
→ Stanford’s GloVe
→ Case Study: Learning Embeddings from Scratch vs Pre trained Word Embeddings
What are the Different Pre trained Word Embeddings?
Embeddings are divided into 2 classes:
→ Word-level
→ Character-level embeddings.
→ ELMo and Flair embeddings are examples of Character-level embeddings.
In this article, we are going to cover two popular word-level pre trained word
embeddings:
→ Google’s Word2Vec
→ Stanford’s GloVe
Word2Vec
Word2Vec is one of the most popular pre trained word embeddings developed by
Google.
→ Word2Vec is trained on the Google News dataset (about 100 billion words).
→ It has several use cases such as Recommendation Engines, Knowledge
Discovery, and also applied in the different Text Classification problems..
→ Did you observe that we didn’t get any semantic meaning from words of
corpus by using previous methods? But for most of the applications of NLP tasks like
sentiment classification, sarcasm detection etc require semantic meaning of a word
and semantic relationships of a word with other words.
By using Bag-of-words and TF-IDF techniques we can not capture the meaning or
relation of the words from vectors.
→ Word2vec constructs such vectors called embeddings.
The Word2vec method learns all those types of relationships of words while building a
model.
For this purpose word2vec uses 2 types of methods. There are
→ Skip-gram
→ CBOW (Continuous Bag of Words)
CBOW (Continuous Bag of Words)
In CBOW model we essentially tries to predict a target word from a list of context words.
→ In CBOW, we define a window size. The middle word is the current word and
the surrounding words (past and future words) are the context. CBOW utilizes the
context to predict the current words. Each word is encoded using One Hot Encoding in
the defined vocabulary and sent to the CBOW neural network.
The hidden layer is a standard fully-connected dense layer.
The output layer generates probabilities for the target word from the vocabulary.
Continuous Skip- gram model
The skip gram model is a simple neural network with one hidden layer trained in order to
predict the probability of a given word
→ The Skip-Gram model being the opposite of the CBOW model
→ It takes the current word as an input and tries to accurately predict
the words before and after the current word
LINKS
https://www.mygreatlearning.com/blog/bag-of-words/ - Good for bag of words
https://www.analyticsvidhya.com/blog/2021/11/how-sklearns-tfidfvectorizer-
calculates-tf-idf-values/ - good for TF-IDF
Embedding for spelling correction
https://towardsdatascience.com/embedding-for-spelling-correction-92c93f835d79
https://dataaspirant.com/word-embedding-techniques-nlp/
Code link for embedding Technique
https://dataaspirant.com/word-embedding-techniques-nlp/
Pre trained model detailed explanation
https://www.analyticsvidhya.com/blog/2020/03/pretrained-word-embeddings-
nlp/#:~:text=Let's%20answer%20the%20big%20question,used%20for%20solving%20other%20tasks.

More Related Content

What's hot

Word embeddings
Word embeddingsWord embeddings
Word embeddings
Shruti kar
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
shaurya uppal
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
Rupak Roy
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
Hanwha System / ICT
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
Gabriel Hamilton
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
NameetDaga1
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
Tomer Lieber
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Robert Lujo
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
WarNik Chow
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
Saeedeh Shekarpour
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
ankit_ppt
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
Data Science Society
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
 
Word2Vec
Word2VecWord2Vec
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
Sanzid Kawsar
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
Minh Pham
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
👋 Christopher Moody
 
NLTK
NLTKNLTK

What's hot (20)

Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 
NLTK
NLTKNLTK
NLTK
 

Similar to Vectorization In NLP.pptx

ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
iwan_rg
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
Osebe Sammi
 
Ir 03
Ir   03Ir   03
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
Massimo Schenone
 
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextGDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
rudolf eremyan
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
Ganesh Borle
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
Lucidworks
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Saurabh Kaushik
 
Paper dissected glove_ global vectors for word representation_ explained _ ...
Paper dissected   glove_ global vectors for word representation_ explained _ ...Paper dissected   glove_ global vectors for word representation_ explained _ ...
Paper dissected glove_ global vectors for word representation_ explained _ ...
Nikhil Jaiswal
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networks
connectbeubax
 
Introduction to Ontology Engineering with Fluent Editor 2014
Introduction to Ontology Engineering with Fluent Editor 2014Introduction to Ontology Engineering with Fluent Editor 2014
Introduction to Ontology Engineering with Fluent Editor 2014
Cognitum
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdf
NohaGhoweil
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...
WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...
WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...
IRJET Journal
 
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
QuantInsti
 
Search pitb
Search pitbSearch pitb
Search pitb
Nawab Iqbal
 
Chatbot_Presentation
Chatbot_PresentationChatbot_Presentation
Chatbot_Presentation
Rohan Chikorde
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)
Sihan Chen
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
Derek Kane
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
RwanEnan
 

Similar to Vectorization In NLP.pptx (20)

ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
 
Ir 03
Ir   03Ir   03
Ir 03
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextGDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Paper dissected glove_ global vectors for word representation_ explained _ ...
Paper dissected   glove_ global vectors for word representation_ explained _ ...Paper dissected   glove_ global vectors for word representation_ explained _ ...
Paper dissected glove_ global vectors for word representation_ explained _ ...
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networks
 
Introduction to Ontology Engineering with Fluent Editor 2014
Introduction to Ontology Engineering with Fluent Editor 2014Introduction to Ontology Engineering with Fluent Editor 2014
Introduction to Ontology Engineering with Fluent Editor 2014
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdf
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...
WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...
WordNet Based Online Reverse Dictionary with Improved Accuracy and Parts-of-S...
 
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
 
Search pitb
Search pitbSearch pitb
Search pitb
 
Chatbot_Presentation
Chatbot_PresentationChatbot_Presentation
Chatbot_Presentation
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 

More from Chode Amarnath

Important Classification and Regression Metrics.pptx
Important Classification and Regression Metrics.pptxImportant Classification and Regression Metrics.pptx
Important Classification and Regression Metrics.pptx
Chode Amarnath
 
The 10 Algorithms Machine Learning Engineers Need to Know.pptx
The 10 Algorithms Machine Learning Engineers Need to Know.pptxThe 10 Algorithms Machine Learning Engineers Need to Know.pptx
The 10 Algorithms Machine Learning Engineers Need to Know.pptx
Chode Amarnath
 
Bag the model with bagging
Bag the model with baggingBag the model with bagging
Bag the model with bagging
Chode Amarnath
 
Feature engineering mean encodings
Feature engineering   mean encodingsFeature engineering   mean encodings
Feature engineering mean encodings
Chode Amarnath
 
Validation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategiesValidation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategies
Chode Amarnath
 
Difference between logistic regression shallow neural network and deep neura...
Difference between logistic regression  shallow neural network and deep neura...Difference between logistic regression  shallow neural network and deep neura...
Difference between logistic regression shallow neural network and deep neura...
Chode Amarnath
 

More from Chode Amarnath (6)

Important Classification and Regression Metrics.pptx
Important Classification and Regression Metrics.pptxImportant Classification and Regression Metrics.pptx
Important Classification and Regression Metrics.pptx
 
The 10 Algorithms Machine Learning Engineers Need to Know.pptx
The 10 Algorithms Machine Learning Engineers Need to Know.pptxThe 10 Algorithms Machine Learning Engineers Need to Know.pptx
The 10 Algorithms Machine Learning Engineers Need to Know.pptx
 
Bag the model with bagging
Bag the model with baggingBag the model with bagging
Bag the model with bagging
 
Feature engineering mean encodings
Feature engineering   mean encodingsFeature engineering   mean encodings
Feature engineering mean encodings
 
Validation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategiesValidation and Over fitting , Validation strategies
Validation and Over fitting , Validation strategies
 
Difference between logistic regression shallow neural network and deep neura...
Difference between logistic regression  shallow neural network and deep neura...Difference between logistic regression  shallow neural network and deep neura...
Difference between logistic regression shallow neural network and deep neura...
 

Recently uploaded

ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
Renu Jangid
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
SSR02
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
RASHMI M G
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 

Recently uploaded (20)

ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 

Vectorization In NLP.pptx

  • 1. Vectorization In NLP By Chode Amarnath
  • 3. USES → Bag of words: Extracts features from the text → TF-IDF: Information retrieval, keyword extraction → Word2Vec: Semantic analysis task → GloVe: Word analogy, named entity recognition tasks → BERT: language translation, question answering system
  • 4. Index → Vectorization Techniques → Word Embedding Pre-trained Methods
  • 5. Vectorization The process of converting word into numbers are called Vectorization → It's easy for us to understand the sentence as we know the semantics of the words and the sentence. But how can any program (eg: python) interpret this sentence? → It is easier for any programming language to understand textual data in the form of numerical value. So, for this reason, we need to vectorize all of the text so that it is better represented.
  • 6. To convert string data into numerical data one can use the following data. 1. Bag of words. 2. TFIDF. 3. Word2Vec
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. What are N-grams and Why do we use them
  • 12. An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram) is a two-word sequence of words like “really good”, “not good”, or “your homework”, and a 3-gram (more commonly called a trigram) is a three-word sequence of words like “not at all”, or “turn off light”.
  • 13. We can conclude that “Bag of bigrams ” is more powerful than “Bag of words”, and many cases it is very hard to beat
  • 14. Bag of Words It is basic model used in natural language processing. → Why it is called bag of words because any order of the words in the document is discarded it only tells us whether word is present in the document or not Eg: “There used to be Stone Age” “There used to be Bronze Age” “There used to be Iron Age” “There was Age of Revolution” “Now it is Digital Age”
  • 15. Here each sentence is separate document if we make a list of words such that one word should be occur only once than our list “There”,”was”,”to”,”be”,”used”,”Stone”,”Bronze,”Iron”,”Revolution”,”Digital”,”Age”,”of”,”No w”,”it”,”is” where we count occurrence of word in a document w.r.t list. For example- vector conversion of sentence “There used to be Stone Age” can be represented as :
  • 16. So here we basically convert word into vector . By following same approach other vector value are as follow: “There used to be bronze age” = [1,0,1,1,1,0,1,0,0,0,1,0,0,0,0] “There used to be iron age” = [1,0,1,1,1,0,0,1,0,0,1,0,0,0,0] “There was age of revolution” = [1,1,0,0,0,0,0,0,1,0,1,1,0,0,0] “Now its digital Age” = [0,0,0,0,0,0,0,0,0,1,1,0,1,1,1]
  • 17. The approach which is discussed above is unigram because we are considering only one word at a time . Similarly we have bigram(using two words at a time- for example — There used, Used to, to be, be Stone, Stone age), trigram(using three words at a time- for example- there used to, used to be ,to be Stone,be Stone Age), ngram(using n words at a time) By using CountVectorizer function we can convert text document to matrix of word count. Matrix which is produced here is sparse matrix.
  • 18. CountVectorizer Convert a collection of text documents to a matrix of token counts. But its major disadvantages are: → Its inability in identifying more important and less important words for analysis. → It will just consider words that are abundant in a corpus as the most statistically significant word. → It also doesn't identify the relationships between words such as linguistic similarity between words.
  • 19.
  • 20. TF-IDF TF-IDF stands for Term Frequency-Inverse Document Frequency which basically tells importance of the word in the corpus or dataset. → TF-IDF contain two concept Term Frequency(TF) and Inverse Document Frequency(IDF).
  • 21. Term - Frequency This Measure the frequency of a word in a document .
  • 22.
  • 23. Inverse Document Frequency Inverse document frequency is another concept which is used for finding out importance of the word. → It is based on the fact that less frequent words are more informative and important
  • 24. TF-IDF It basically reduces values of common word that are used in different document.
  • 25. Difference Between Count Vectorizer and TF-IDF → TF-IDF Vectorizer and Countvectorizer are both methods used in natural language processing to vectorize text. However, there is a fundamental difference between the two methods. → CountVectorizer simply counts the number of times a word appears in a document (using a bag-of-words approach), while TF-IDF Vectorizer takes into account not only how many times a word appears in a document but also how important that word is to the whole corpus.
  • 26. Word Embedding(https://towardsdatascience.com/word2vec-explained-49c52b4ccb71) Word embedding is a technique where individual words are transformed into a numerical representation of a word(a Vector). → The vector try to capture various characteristics of that word with regards to the overall text. These characteristics include the semantic relationship of the word, definition, context etc. → With these numerical representation we can do many things like identify similarity or dissimilarity between words. → In this approach, words and documents are represented in the form of numeric vectors allowing similar words to have similar vector representations.
  • 27. Table of Content → What are Pre trained Word Embeddings? → Why do we need Pre trained Word Embeddings? → What are the Different Pre trained Word Embeddings? → Google’s Word2vec → Stanford’s GloVe → Case Study: Learning Embeddings from Scratch vs Pre trained Word Embeddings
  • 28. What are the Different Pre trained Word Embeddings? Embeddings are divided into 2 classes: → Word-level → Character-level embeddings. → ELMo and Flair embeddings are examples of Character-level embeddings. In this article, we are going to cover two popular word-level pre trained word embeddings: → Google’s Word2Vec → Stanford’s GloVe
  • 29. Word2Vec Word2Vec is one of the most popular pre trained word embeddings developed by Google. → Word2Vec is trained on the Google News dataset (about 100 billion words). → It has several use cases such as Recommendation Engines, Knowledge Discovery, and also applied in the different Text Classification problems.. → Did you observe that we didn’t get any semantic meaning from words of corpus by using previous methods? But for most of the applications of NLP tasks like sentiment classification, sarcasm detection etc require semantic meaning of a word and semantic relationships of a word with other words.
  • 30. By using Bag-of-words and TF-IDF techniques we can not capture the meaning or relation of the words from vectors. → Word2vec constructs such vectors called embeddings. The Word2vec method learns all those types of relationships of words while building a model. For this purpose word2vec uses 2 types of methods. There are → Skip-gram → CBOW (Continuous Bag of Words)
  • 31.
  • 32. CBOW (Continuous Bag of Words) In CBOW model we essentially tries to predict a target word from a list of context words. → In CBOW, we define a window size. The middle word is the current word and the surrounding words (past and future words) are the context. CBOW utilizes the context to predict the current words. Each word is encoded using One Hot Encoding in the defined vocabulary and sent to the CBOW neural network.
  • 33.
  • 34. The hidden layer is a standard fully-connected dense layer. The output layer generates probabilities for the target word from the vocabulary.
  • 35.
  • 36. Continuous Skip- gram model The skip gram model is a simple neural network with one hidden layer trained in order to predict the probability of a given word → The Skip-Gram model being the opposite of the CBOW model → It takes the current word as an input and tries to accurately predict the words before and after the current word
  • 37.
  • 38. LINKS https://www.mygreatlearning.com/blog/bag-of-words/ - Good for bag of words https://www.analyticsvidhya.com/blog/2021/11/how-sklearns-tfidfvectorizer- calculates-tf-idf-values/ - good for TF-IDF
  • 39. Embedding for spelling correction https://towardsdatascience.com/embedding-for-spelling-correction-92c93f835d79 https://dataaspirant.com/word-embedding-techniques-nlp/ Code link for embedding Technique https://dataaspirant.com/word-embedding-techniques-nlp/ Pre trained model detailed explanation https://www.analyticsvidhya.com/blog/2020/03/pretrained-word-embeddings- nlp/#:~:text=Let's%20answer%20the%20big%20question,used%20for%20solving%20other%20tasks.