ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop

Introduction to word
embeddings in Deep
Learning
Sharefah Al-Ghamdi Sharefah@ksu.edu.sa

LOGO Outlines
 Introduction.
 What are word embeddings?
 Why do we need word embeddings?
 Different types of word embeddings.
 Word embeddings Tools.
 Word2vec.
 Word embedding tutorial.

LOGO Introduction
Machine Translation
Search Engines
Spam Filtering
…

LOGO Introduction
We need a representation for words that capture their meanings, semantic relationships and
the different types of contexts they are used in. (It is Words embeddings)

LOGO What are Words Embeddings?
 A set of language modeling and feature learning
natural language processing (NLP) where words are
vectors of real numbers.
 Words that have the same meaning have a similar word
embedding representation.

LOGO What are Word Embeddings?

LOGO Why do we need words embeddings?
 The need for unsupervised learning .
 Solve various NLP applications not only one task.
 Many machine learning algorithms (including deep nets) require their input to be
their input to be vectors of continuous values; they just won’t work on strings of plain
on strings of plain text. (cat, cats, dog ..)
 Vector representation has two important and advantageous properties:
properties:
 Dimensionality Reduction.
 Contextual Similarity.

LOGO Example of Simple Method
 sentence=
” Word Embeddings are Word converted into numbers ”
 dictionary =
[‘Word’,’Embeddings’,’are’,’Converted’,’into’,’numbers’]
 The vector representation of “numbers” in a one-hot encoded vector
encoded vector according to the above dictionary is:
“numbers” [0,0,0,0,0,1]

LOGO Different types of Word Embeddings
1. Frequency based Embedding:
 Count Vector
 TF-IDF Vector
 Co-Occurrence Vector
2. Prediction based Embedding:
 CBOW (Continuous Bag of words)
 Skip – Gram model

 Count Vector
 TF-IDF Vector
• Takes into account the entire
corpus.
• Penalizing common words by
assigning them lower weights.
• TF: (Number of times term t
appears in a document)/(Number
of terms in the document)
• IDF: log(N/n), where, N is the
number of documents and n is
the number of documents a term
t has appeared in

 Count Vector
 TF-IDF Vector
• The Idea is that similar
words tend to occur together
and will have similar context.
• Co-occurrence is the
number of times the words
have appeared together in a
Context Window.
• Context window is
specified by a number and
the direction.

LOGO Co-Occurrence Vector Example
‫مزدحمة‬ ‫الرياض‬ ‫مدينة‬ ‫متطورة‬ ‫الرياض‬ ‫مدينة‬ ‫السعودية‬ ‫عاصمة‬ ‫الرياض‬ ‫مدينة‬
The 2 (around) context window for the word ‘‫’عاصمة‬
let us calculate a
co-occurrence matrix.
‫مدينة‬‫الرياض‬‫عاصمة‬‫السعودية‬‫متطورة‬‫مزدحمة‬
‫مدينة‬?
‫الرياض‬?
‫عاصمة‬?
‫السعودية‬?
‫متطورة‬?
‫مزدحمة‬?

‫الرياض‬ ‫مدينة‬
‫عاصمة‬ ‫مدينة‬
‫متطورة‬ ‫مدينة‬
‫السعودية‬ ‫مدينة‬
‫عاصمة‬ ‫مدينة‬
‫مزدحمة‬ ‫مدينة‬
‫متطورة‬ ‫مدينة‬
The co-occurrence of two word ( ‫مدينة‬,‫الرياض‬ ) is 4.

The co-occurrence matrix
is not the word vector
representation that is
generally used. Instead, it
is decomposed using
techniques like PCA, SVD
etc. into factors and
combination of these
factors forms the word
vector representation.
‫مدينة‬‫الرياض‬‫عاصمة‬‫السعودية‬‫متطورة‬‫مزدحمة‬
‫مدينة‬0
‫الرياض‬4
‫عاصمة‬2
‫السعودية‬1
‫متطورة‬2
‫مزدحمة‬1

 Count Vector
 TF-IDF Vector
• Introduced in word2vec.
• Both are shallow neural networks
(NN) with three layers.
• Neural networks map word(s) to
the target variable which is also
a word(s).
• Both techniques learn weights
which act as word vector
representations.

 Count Vector
 TF-IDF Vector
• The CBOW NN predict the
probability of a word given a
context.

LOGO CBOW
0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
The word ‘Sample’
The word ‘Corpus’
N is the number of
neurons in the
hidden layer =
number of
dimensions we
choose to represent
our word
The weight between the hidden
layer and the output layer is
taken as the word vector
representation of the word.

 Count Vector
 TF-IDF Vector
• The aim of skip-gram is to
predict the context given a
word.

LOGO Skip-Gram
Hey this is sample corpus using only one context word
In this example
C=2

LOGO CBOW and Skip-Gram
 In simpler words, CBOW tends to find the probability of a
word occurring in a neighbourhood (context). So
it generalises over all the different contexts in which a word
can be used.
 Whereas SkipGram tends to learn the different contexts
separately. So SkipGram needs enough data w.r.t. each
context. Hence SkipGram requires more data to train, also
SkipGram (given enough data) contains more knowledge
about the context.

LOGO Word embeddings Tools
The most famous algorithms used to build words embeddings:
Tool Technique/s Created by
Word2vec
• CBOW
• Skip-Gram
A team of researchers led by
Tomas Mikolov at Google.
(Mikolov et al 2013)
Glove • Co-occurrence.
It is developed as an open-
source project at Stanford
University. (Pennington et al
2014)
fastText
• BOW + Subword Information:
based on the skip-gram model, each word
is represented as a bag of character n-
grams
Facebook's AI Research
(FAIR) lab. (2018)

LOGO Compare GloVe and word2vec

LOGO Word embeddings Tools
• Gensim is a robust open-source vector space modeling and topic
modeling toolkit implemented in Python. Gensim includes
implementations of word2vec, document2vec algorithms etc.
• Eclipse Deeplearning4j is a deep learning programming library
written for Java. Deeplearning4j includes implementations of the
word2vec, doc2vec, and GloVe etc.

LOGO Word2Vec Tutorial
 Gensim libarary provides easy way to use Word2Vec in Python.
 Classical Arabic Corpus written by Maha Alrabiah.
 You will apply the following:
 Train your own word2vec on an Arabic corpus.
 Save your model.
 Printing the words similarity scores.
 Getting word vector of a word.
 picking odd word out.
 Load a pre-trained model.

 Use the following python codes to do each step:
1. Import the needed models:
import gensim
from gensim.models import word2vec
import logging
2. Set up logging:
logging.basicConfig(format='%(asctime)s :
%(levelname)s : %(message)s',
level=logging.INFO)

3. load up your corpus
sentences = word2vec.Text8Corpus('…/your_data')
4. Train the model
model = word2vec.Word2Vec(sentences, size=300,
sg=0)
Size: Dimensionality of the feature vectors.
Sg: Defines the training algorithm. If 1, CBOW is used, otherwise, skip-gram
is employed.
Window: The maximum distance between the current and predicted word
within a sentence.
min_count: Ignores all words with total frequency lower than this.
…

5. Save the trained models:
model.save('…/workshop.model')
model.wv.save_word2vec_format('…/workshop.bin',
binary=True)
6. Printing the words similarity scores:
most_similar = model.wv.most_similar('‫)'السماء‬
for term, score in most_similar:
print(term, score)
model.wv.similarity('word1', 'word2')

7. Getting word vector of a word:
print(model[‘‫)]’محمد‬
8. picking odd word out:
print(model.doesnt_match(" ‫السبع‬ ‫السماوات‬
‫الجبال‬ ‫االرض‬‫مكه‬ ".split()))
9. Load a pre-trained model:
model =
gensim.models.KeyedVectors.load_word2vec_f
ormat(‘…/workshop.bin', binary=True)

LOGO References
 https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-
word2veec/
 https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/
 http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/
 https://machinelearningmastery.com/what-are-word-embeddings/
 https://flovv.github.io/figures/post29/embedding.png
 https://www.quora.com/What-is-word-embedding-in-deep-learning

LOGO References
 http://techblog.gumgum.com/articles/deep-learning-for-natural-language-
processing-part-1-word-embeddings
 https://www.inverse.com/article/31075-facebook-machine-learning-language-
fasttext
 https://en.wikipedia.org/wiki/FastText
 https://medium.com/deeper-learning/glossary-of-deep-learning-word-
embedding-f90c3cec34ca
 https://www.quora.com/Whats-the-best-word2vec-implementation-for-
generating-Word-Vectors-Word-Embeddings-of-a-2Gig-corpus-with-2-billion-
words

LOGO References
 https://radimrehurek.com/gensim/models/word2vec.html
 https://hackernoon.com/word2vec-part-1-fe2ec6514d70
 http://dsnotes.com/post/glove-enwiki/
 https://mahaalrabiah.wordpress.com/

ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop

Similar to ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop (20)

More from iwan_rg

More from iwan_rg (20)

Recently uploaded

Recently uploaded (20)

ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop

Editor's Notes