2. LOGO Outlines
Introduction.
What are word embeddings?
Why do we need word embeddings?
Different types of word embeddings.
Word embeddings Tools.
Word2vec.
Word embedding tutorial.
4. LOGO Introduction
We need a representation for words that capture their meanings, semantic relationships and
the different types of contexts they are used in. (It is Words embeddings)
5. LOGO What are Words Embeddings?
A set of language modeling and feature learning
natural language processing (NLP) where words are
vectors of real numbers.
Words that have the same meaning have a similar word
embedding representation.
7. LOGO Why do we need words embeddings?
The need for unsupervised learning .
Solve various NLP applications not only one task.
Many machine learning algorithms (including deep nets) require their input to be
their input to be vectors of continuous values; they just won’t work on strings of plain
on strings of plain text. (cat, cats, dog ..)
Vector representation has two important and advantageous properties:
properties:
Dimensionality Reduction.
Contextual Similarity.
8. LOGO Example of Simple Method
sentence=
” Word Embeddings are Word converted into numbers ”
dictionary =
[‘Word’,’Embeddings’,’are’,’Converted’,’into’,’numbers’]
The vector representation of “numbers” in a one-hot encoded vector
encoded vector according to the above dictionary is:
“numbers” [0,0,0,0,0,1]
9. LOGO Different types of Word Embeddings
1. Frequency based Embedding:
Count Vector
TF-IDF Vector
Co-Occurrence Vector
2. Prediction based Embedding:
CBOW (Continuous Bag of words)
Skip – Gram model
10. LOGO Different types of Word Embeddings
1. Frequency based Embedding:
Count Vector
TF-IDF Vector
Co-Occurrence Vector
2. Prediction based Embedding:
CBOW (Continuous Bag of words)
Skip – Gram model
11. LOGO Different types of Word Embeddings
1. Frequency based Embedding:
Count Vector
TF-IDF Vector
Co-Occurrence Vector
2. Prediction based Embedding:
CBOW (Continuous Bag of words)
Skip – Gram model
• Takes into account the entire
corpus.
• Penalizing common words by
assigning them lower weights.
• TF: (Number of times term t
appears in a document)/(Number
of terms in the document)
• IDF: log(N/n), where, N is the
number of documents and n is
the number of documents a term
t has appeared in
12. LOGO Different types of Word Embeddings
1. Frequency based Embedding:
Count Vector
TF-IDF Vector
Co-Occurrence Vector
2. Prediction based Embedding:
CBOW (Continuous Bag of words)
Skip – Gram model
• The Idea is that similar
words tend to occur together
and will have similar context.
• Co-occurrence is the
number of times the words
have appeared together in a
Context Window.
• Context window is
specified by a number and
the direction.
13. LOGO Co-Occurrence Vector Example
مزدحمة الرياض مدينة متطورة الرياض مدينة السعودية عاصمة الرياض مدينة
The 2 (around) context window for the word ‘’عاصمة
let us calculate a
co-occurrence matrix.
مدينةالرياضعاصمةالسعوديةمتطورةمزدحمة
مدينة?
الرياض?
عاصمة?
السعودية?
متطورة?
مزدحمة?
14. LOGO Co-Occurrence Vector Example
مزدحمة الرياض مدينة متطورة الرياض مدينة السعودية عاصمة الرياض مدينة
الرياض مدينة
عاصمة مدينة
مزدحمة الرياض مدينة متطورة الرياض مدينة السعودية عاصمة الرياض مدينة
الرياض مدينة
متطورة مدينة
السعودية مدينة
عاصمة مدينة
مزدحمة الرياض مدينة متطورة الرياض مدينة السعودية عاصمة الرياض مدينة
الرياض مدينة
مزدحمة مدينة
متطورة مدينة
الرياض مدينة
The co-occurrence of two word ( مدينة,الرياض ) is 4.
15. LOGO Co-Occurrence Vector Example
The co-occurrence matrix
is not the word vector
representation that is
generally used. Instead, it
is decomposed using
techniques like PCA, SVD
etc. into factors and
combination of these
factors forms the word
vector representation.
مدينةالرياضعاصمةالسعوديةمتطورةمزدحمة
مدينة0
الرياض4
عاصمة2
السعودية1
متطورة2
مزدحمة1
16. LOGO Different types of Word Embeddings
1. Frequency based Embedding:
Count Vector
TF-IDF Vector
Co-Occurrence Vector
2. Prediction based Embedding:
CBOW (Continuous Bag of words)
Skip – Gram model
• Introduced in word2vec.
• Both are shallow neural networks
(NN) with three layers.
• Neural networks map word(s) to
the target variable which is also
a word(s).
• Both techniques learn weights
which act as word vector
representations.
17. LOGO Different types of Word Embeddings
1. Frequency based Embedding:
Count Vector
TF-IDF Vector
Co-Occurrence Vector
2. Prediction based Embedding:
CBOW (Continuous Bag of words)
Skip – Gram model
• The CBOW NN predict the
probability of a word given a
context.
19. LOGO CBOW
0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
The word ‘Sample’
The word ‘Corpus’
N is the number of
neurons in the
hidden layer =
number of
dimensions we
choose to represent
our word
The weight between the hidden
layer and the output layer is
taken as the word vector
representation of the word.
21. LOGO Different types of Word Embeddings
1. Frequency based Embedding:
Count Vector
TF-IDF Vector
Co-Occurrence Vector
2. Prediction based Embedding:
CBOW (Continuous Bag of words)
Skip – Gram model
• The aim of skip-gram is to
predict the context given a
word.
23. LOGO CBOW and Skip-Gram
In simpler words, CBOW tends to find the probability of a
word occurring in a neighbourhood (context). So
it generalises over all the different contexts in which a word
can be used.
Whereas SkipGram tends to learn the different contexts
separately. So SkipGram needs enough data w.r.t. each
context. Hence SkipGram requires more data to train, also
SkipGram (given enough data) contains more knowledge
about the context.
24. LOGO Word embeddings Tools
The most famous algorithms used to build words embeddings:
Tool Technique/s Created by
Word2vec
• CBOW
• Skip-Gram
A team of researchers led by
Tomas Mikolov at Google.
(Mikolov et al 2013)
Glove • Co-occurrence.
It is developed as an open-
source project at Stanford
University. (Pennington et al
2014)
fastText
• BOW + Subword Information:
based on the skip-gram model, each word
is represented as a bag of character n-
grams
Facebook's AI Research
(FAIR) lab. (2018)
26. LOGO Word embeddings Tools
• Gensim is a robust open-source vector space modeling and topic
modeling toolkit implemented in Python. Gensim includes
implementations of word2vec, document2vec algorithms etc.
• Eclipse Deeplearning4j is a deep learning programming library
written for Java. Deeplearning4j includes implementations of the
word2vec, doc2vec, and GloVe etc.
28. LOGO Word2Vec Tutorial
Gensim libarary provides easy way to use Word2Vec in Python.
Classical Arabic Corpus written by Maha Alrabiah.
You will apply the following:
Train your own word2vec on an Arabic corpus.
Save your model.
Printing the words similarity scores.
Getting word vector of a word.
picking odd word out.
Load a pre-trained model.
29. LOGO Word2Vec Tutorial
Use the following python codes to do each step:
1. Import the needed models:
import gensim
from gensim.models import word2vec
import logging
2. Set up logging:
logging.basicConfig(format='%(asctime)s :
%(levelname)s : %(message)s',
level=logging.INFO)
30. LOGO Word2Vec Tutorial
3. load up your corpus
sentences = word2vec.Text8Corpus('…/your_data')
4. Train the model
model = word2vec.Word2Vec(sentences, size=300,
sg=0)
Size: Dimensionality of the feature vectors.
Sg: Defines the training algorithm. If 1, CBOW is used, otherwise, skip-gram
is employed.
Window: The maximum distance between the current and predicted word
within a sentence.
min_count: Ignores all words with total frequency lower than this.
…
31. LOGO Word2Vec Tutorial
Use the following python codes to do each step:
5. Save the trained models:
model.save('…/workshop.model')
model.wv.save_word2vec_format('…/workshop.bin',
binary=True)
6. Printing the words similarity scores:
most_similar = model.wv.most_similar(')'السماء
for term, score in most_similar:
print(term, score)
model.wv.similarity('word1', 'word2')
32. LOGO Word2Vec Tutorial
Use the following python codes to do each step:
7. Getting word vector of a word:
print(model[‘)]’محمد
8. picking odd word out:
print(model.doesnt_match(" السبع السماوات
الجبال االرضمكه ".split()))
9. Load a pre-trained model:
model =
gensim.models.KeyedVectors.load_word2vec_f
ormat(‘…/workshop.bin', binary=True)
we use natural language applications or benefit from them every day
Natural Language Processing (NLP) helps machines “read” text by simulating the human ability to understand language.
The challenge with machine translation technologies is not in translating words, but in understanding the meaning of sentences to provide a true translation.
It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.
It is considered one of the key breakthroughs of deep learning on challenging natural language processing problems.
such as part-of-speech tagging, information retrieval, question answering etc.
One of the benefits of using dense and low-dimensional vectors is computational: the majority of neural network toolkits do not play well with very high-dimensional, sparse vectors. … The main benefit of the dense representations is generalization power: if we believe some features may provide similar clues, it is worthwhile to provide a representation that is able to capture these similarities.
Dimensionality Reduction — it is a more efficient representation
Contextual Similarity — it is a more expressive representation
Page 92, Neural Network Methods in Natural Language Processing, 2017.
In the first type the statistics of word co-occurrences with its neighbor words computed, and then map these statistics down to a vector for each word.
However, predictive models take 'raw' text as input and learn a word by predicting its surrounding context (the case of the skip-gram model) or predict a word given its surrounding context (the case of the Continuous Bag Of Words model (CBOW)) using gradient descent with randomly initialized vectors2.
there may be quite a few variations:
The way dictionary is prepared.: top 10,000 words
The way count is taken for each word.: frequency or the presence
In this method computes the statistics of word co-occurrences with its neighbor words, and then map these statistics down to a vector for each word.
The input layer and the target, both are one- hot encoded of size [1 X V].
The input can be assumed as taking multi one-hot encoded vectors.
The calculation of hidden activation changes.
Instead of just copying the corresponding rows of the input-hidden weight matrix to the hidden layer, an average is taken over all the corresponding rows of the matrix.
It just flips CBOW’s architecture on its head.
sum is taken over all the error vectors to obtain a final error vector.