JOSA TechTalks - Word Embedding and Word2Vec Explained

Word Embedding - Word2Vec and
Relatives
13/2/18 1
Wael Farhan - Mawdoo3
University of California, San Diego
JOSA
Jordan Open Source Association

Goal of the talk
If you don’t know Word2Vec:
Learn what Word2Vec does and why it is useful.
If you know word2vec:
Learn how to use it.
If you already used Word2Vec:
Learn how it works under the hood.
If you know how Word2Vec works:
Check out different variations(FastText) and different usage scenarios.
If you already used it:
Share your experience!!!
213/2/18

Text, text, and even more text
 A lot of our businesses rely on serving new content for our users
 Blogs, QnA, social media, dictionaries, books …
 Text is still going to be generated on a larger scale.
313/2/18

Problem?
Not getting a lot of sense out of this large corpus of text
413/2/18

Goals of Natural Language Processing
 Similar Words - Enhanced Search Engine
 Find synonyms for words
 Sentiment Analysis
 Determine if the sentence is positive or negative
 Machine Translation
 machine translate text from one human language to another
 Question and Answer
 Given a question find the correct answer.
 Categorization
 Automatically classify a question, a book, a description … etc.
513/2/18

Nature of NLP Problems
613/2/18
NLP
Model
Input Sentence
Example:
The new song is gorgeous
Output
Can be a translation:
‫األغنية‬‫الجديدة‬‫رائعة‬
Or prediction:
Positive Sentiment

Word Representation
713/2/18
~170,000
Words
In order to process natural language we need to represent words
in a way computer (Neural Networks) can understand
1
0
0
…
0
0
0
“a”
0
1
0
…
0
0
0
“abbreviations”
0
0
0
…
0
1
0
“zoology”
0
0
0
…
0
0
1
“zoom”
…
0
1
2
99,997
99,998
99,999
One Hot Representation

Goal
Is to find a better word representation that captures
word semantic meaning and relations.
Consume less space and make NLP models train
faster and more effective.
813/2/18

Word Properties
913/2/18
 Words have similar properties:
The cat is eating dinner
The dog is eating dinner
 Morphological parsing:
He works hard
He worked hard
 Word relationships:
Berlin is the capital of Germany
Rome is the capital of Italy
 Synonyms and Antonyms
 … etc.

Hypothetical Word Space
1013/2/18
Gender
Tense
Singular/Plural
‫ذهبت‬‫ذهب‬
‫ذهبن‬‫ذهبوا‬
‫يذهبون‬
‫يذهب‬ ‫تذهب‬
(0,0,1)
(0,1,1)
(1,0,1)
(0,1,-1)
(1,0,-1)
(0,0,-1)
‫يذهبن‬
(1,1,1)(1,1,-1)
‫ذهبا‬
(0.5,0,-1)

Benefits
With the magic of Word2Vec we are able to extract the
dimensions that capture the semantic meaning and word
properties of words.
1113/2/18

Example - 1
1213/2/18https://github.com/wael34218/word_embeddings

Example - 1

Example - 2
 In the his research paper Mikolov mentions the following example:
vec("King") - vec("Man") + vec("Woman") ≈ vec (”Queen")
1513/2/18
Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
Man
Woman
King
Queen

Example - 3
1613/2/18
China
Beijing
Russia
Moscow
Japan
Tokyo
Turkey
Ankara
Germany
Berlin
France Paris
Italy
Rome
Spain Madrid
Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

Cosine Similarity
1713/2/18
A . B
|A | |B|
similarity(A, B) =
Apple
Banana
Car
similarity ≈ 1
similarity ≈ 0

How Word2Vec Helps
 Similar Words - Enhanced Search Engine
 Finding the nearest words to your query
 Sentiment Analysis
 Few dimensions can indicate whether the sentiment is good or bad
 Machine Translation & Question and Answer
 Similar words will be treated the same.
 Categorization
 Words from the same field (politics, sports …etc.) will be clustered in the
same area.
1813/2/18

Milestones
1913/2/18

How to use Word2Vec - Python
2013/2/18

Vector Lookup - Python
2113/2/18
array([ 2.3184156e+00, -3.7887740e+00, -1.5767757e+00, 2.0226068e+00,
-6.2421966e-01, -1.9068855e+00, 4.3609171e+00, 5.2191928e-02,
-2.1188414e+00, -2.3100405e+00, 5.8612001e-01, 1.3959754e+00,
-1.3236507e+00, 5.3390409e-05, -3.2201216e-01, 2.6455419e+00,
4.5713658e+00, 4.3059942e-01, -9.1906101e-01, -1.8534625e-02,
-2.7166015e-01, -2.2993209e+00, 7.8024969e-02, -3.2237511e+00,
3.3045592e+00, -1.0334913e+00, 1.4119573e+00, 3.7495871e+00,
2.8075716e+00, 1.0440959e-01, -3.9444833e+00, -2.2009590e-01,
-2.9403548e+00, -1.4462236e+00, 2.4799142e+00, 7.7769762e-01,
5.4318172e-01, -2.6818683e+00, -3.0701482e+00, 4.3109632e+00,
-7.7415538e-01, 1.9786474e+00, 1.1503514e+00, 2.6723063e+00,
-1.5133847e+00, 1.4275682e-01, 3.5057294e-01, 6.3898432e-01,
9.9464995e-01, 1.7852293e+00, 9.5475733e-01, 2.9222999e+00,
-3.5561893e+00, 3.1446383e+00, -4.4377699e+00, -4.3674165e-01,
-1.9084896e-01, 2.8170996e+00, -3.0291042e+00, -1.1227336e+00,
-3.6801448e+00, -1.2687838e+00, 1.7091125e-01, -8.1778312e-01,
2.1771207e+00, -2.6653576e+00, -1.5208750e+00, -1.8047930e-01,
8.8296290e-03, -2.7885602e+00, -1.5657809e+00, -2.3738770e+00,
8.7824135e+00, -9.4801110e-01, 2.1755044e+00, -2.1538272e+00,
-5.9697658e-01, -8.5682195e-01, 2.5586643e+00, -4.3383533e-01,
-1.3269461e+00, 3.8761835e+00, -8.1207365e-02, -1.6046954e+00,
-4.4856617e-01, -3.2454314e+00, 2.5956264e+00, -3.6466745e-01,
7.7527708e-01, -7.4778008e+00, 2.3812482e+00, -4.6497111e+00,
-2.4220943e+00, 1.5012804e-01, -1.5416908e+00, -3.4357128e+00,
3.7048867e+00, -4.2515426e+00, 5.9101069e-01, -1.0800831e+00],
dtype=float32)
]’‫[’بيان‬model

Vector Most Similar - Python
2213/2/18
model.most_similar(['‫)]'بيان‬

2313/2/18
model.most_similar(['‫)]'متر‬

2413/2/18
model.most_similar(['‫)]'كندا‬

Milestones
2513/2/18

Under the Hood
2613/2/18
Tomas Mikolov

Under the Hood
 Word2Vec is a simple neural network with one hidden layer.
 It learns to predict the missing word:
‫يوجد‬‫عدد‬ ‫األرض‬ ‫على‬‫المواد‬ ‫من‬ ‫كبير‬‫الكيميائية‬‫ذات‬ ‫الصلبة‬‫الصيغ‬‫المعقدة‬
 It has 2 different flavors:
 Continuous Bag of Words (CBoW)
 SkipGram
2713/2/18
Missing Word
Context (C) Window Size = 3

CBOW Architecture
Given a sentence (X) predict
what is the missing word (y).
 P(y|X)
 P(y=“is” | X = [“The”, “cat”,
“eating”, “dinner”]
 V: Vocab Size ~170k
 N: Vector embedding size
 W: Weights
2813/2/18
…………………………………………
………
……………
Hidden Layer Output Layer
Input Layer
W’NxV
WNxV
WNxV
WNxV
N-dim
C x V-dim
V-dim
x1
x2
xc
yh

SkipGram Architecture
2913/2/18
Given a word (x) predict what
the rest of the sentence (Y).
 P(Y|x)
 P(Y = [“The”, “is”, “eating”,
“dinner”] | x=“cat”)
 V: Vocab Size ~170k
 N: Vector embedding size
 W: Weights
…………………………………………
………
……………
Hidden Layer
Input Layer
WNxV
N-dim
C x V-dim
V-dim
y1
y2
x
W’NxV
W’NxV
W’NxV
Output Layer
yc
h

Forward Feed
3013/2/18
x = one_hot_representation
W1 = tf.Variable(tf.random_normal([vocab_size, embedding_size]))
b1 = tf.Variable(tf.zeros([embedding_size], dtype=tf.float64))
Y1 = tf.add(tf.matmul(x, W1), b1)
W2 = tf.Variable(tf.random_normal([embedding_size, vocab_size]))
b2 = tf.Variable(tf.zeros([output_length], dtype=tf.float64))
pred = tf.nn.softmax(tf.add(tf.matmul(Y1, W2), b2))
x = OneHot[“dog”]
h = x * W1 W1 is a [V * N] matrix
Y = h * W2 W2 is a [N * V] matrix
P = Softmax(Y)
…………………………………………
………
……………
Hidden Layer
Input Layer
WNxV
N-dim
C x V-dim
V-dim
y1
y2
x
W’NxV
W’NxV
W’NxV
yc
h

Output Layer - Softmax
3113/2/18
yi = h * W2
j loops over vocab size
…………………………………………
………
……………
Hidden Layer
Input Layer
WNxV
N-dim
C x V-dim
V-dim
y1
y2
x
W’NxV
W’NxV
W’NxV
Output Layer
yc
h

Backpropagation
3213/2/18
1
1
Input
V dim
Hidden
N dim
Output
V dim
Output Target
0.1 0
0.2 0
0.6 1
0.1 0
Cross Entropy
Loss Function:

Gradient Descent
3313/2/18
w
J(w)

Stages of Learning
1- Prepare Dataset
2- Define your network (weights and topology)
3- Forward Propagation
4- Loss Function
5- Back-Propagation
6- Update Weights
7- Repeat from starting from 3 to 6
3413/2/18

Negative Sampling
 Calculating V probabilities to compute Softmax is
slow.
 We only select few randomly selected number of
“negative” words instead of the entire V words. In
addition to the positive word.
 Probability for each word being selected as a
negative sample is proportional to the frequency.
The more frequent the word is the higher
probability it has to be selected.
3513/2/18
Output Layer
V-dim
y
0.21
0.31
0.01
0.11
0.34
0.21
0.08
0.12

Minimum Count of Words
 Removing words with very low occurrences in the text.
 Not enough repetitions to learn accurate projection.
 Example:
If minimum count is set to 5, then fox will be dropped from the
sentence
“the quick brown jumps over the lazy dog”
3613/2/18
the quick brown fox jumps over the lazy dog
200 56 13 2 17 77 200 90 90

Subsampling Frequent Words
 Frequent words like “The” shows in many training examples.
 We will have much more training examples to learn the vector than
we need.
 Each word will have a probability of being deleted from the corpus
based on its frequency.
“the quick brown fox jumps over the lazy dog”
2 “the” words have high probability, the sentence could end up:
“the quick brown fox jumps over lazy dog”
3713/2/18

End Result
 Usually Deep Learning models are the output after training.
 In word2vec the weights is the output
3813/2/18
SkipGram CBoW
…………………………………
………
…………
Hidden Layer Output Layer
W’NxV
WNxV
WNxV
WNxV
N-dim
C x V-dim
V-dim
x1
x2
xc
yh
…………………………………
……
…………
Hidden Layer
Input Layer
WNxV
N-dim
C x V-dim
V-dim
y1
y2
x
W’NxV
W’NxV
W’NxV
Output Layer
yc
h

Evaluation
 Relatedness (Synonym words)
 Analogy (Resemblance of relationship)
 Categorization (Clusters of words: e.g. positive adjectives)
 Selection Preference (“People eat” and not “eat people”)
Problem: Building a Query Inventory
Easiest way is:
 Measure accuracy form the task at hand
3913/2/18
Schnabel, Tobias, et al. "Evaluation methods for unsupervised word embeddings." Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing. 2015.

Tuning Hyperparameters
 Embedding Size
 Window Size
 Model (CBoW or SkipGram)
 Negative Samples : between (5 - 20)
 Minimum Count
 Subsampling rate : between (0 – 1e-5)
 Learning Rate
 Batch Size
 Epochs
4013/2/18

Limitations
 Cannot Handle Out of Vocabulary.
 Hard to evaluate by itself.
 Can find different words similar:
‫الطالب‬ ‫ذهب‬‫النشيط‬‫المدرسة‬ ‫إلى‬
‫الطالب‬ ‫ذهب‬‫المشاغب‬‫المدرسة‬ ‫إلى‬
Months, days, colors … etc.
4113/2/18

Milestones
4213/2/18

FastText
 The input is not only the word, but also its composite n-grams.
For example the word: swimming with 3-grams
^sw | swi | wim | imm | mmi | min | ing | ng$ | ^swimming$
 Network now needs to learn vector embeddings of subwords.
 The final vector of a word equals the sum of its subword vectors in
addition to its vector itself.
 Advantages:
 Handles Out Of Vocabulary (OOV) by approximating the vector
representation of a missing word.
4313/2/18

My First Use For Word2Vec
 Starting from a database :
Lab tests – Prescriptions – Diagnoses - Symptoms - Conditions
 Extract all medical events for particular patient.
 Sort events in chronological order
 Add prefix (‘l_’, ‘p_’, ‘d_’, ‘s_’, ‘c_’) to event ID.
Patient 512: S_253 l_50970 l_50971 p_DW100 … d_427 d_038
4413/2/18
Time
Farhan, Wael, et al. "A predictive model for medical events based on contextual embedding of temporal sequences." JMIR medical informatics 4.4 (2016).
Fever High HDL Cardiac Rhythm Abnormality

Medical Embeddings
V[Heart Failure] – V[Furosemide] ≈ V[Acute Renal Failure] – V[Sodium Cholride]
4513/2/18
Farhan, Wael, et al. "A predictive model for medical events based on contextual embedding of temporal sequences." JMIR medical informatics 4.4 (2016).

Doc2Vec
4613/2/18
Paragraph Vector - Distributed
Memory (PV-DM)
Paragraph Vector - Distributed Bag of
Words (PV-DBOW)
Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International Conference on Machine Learning. 2014.

t-distributed stochastic neighbor embedding
 t-SNE for short
 How did we get from 300 dimensions to 2?
 t-SNE is a dimensionality reduction algorithm.
 This is used in ML purposes to ease visualization.
4713/2/18

Summary
 Word2Vec is an unsupervised learning tool that generates vector
representation capturing semantic meaning of the words.
 It has 2 different architectures. CBoW and SkipGram
 Word2Vec boosted accuracies of almost all NLP problems
 The concept of Word2Vec can be generalized to other problems
like Medical, Stock Market, DNA analysis … etc.
4813/2/18

Milestones
4913/2/18

Share Your Story
How was your experience with
Word2Vec?
5013/2/18

We are Hiring!!!
Send your CV:
wael.farhan@mawdoo3.com
5113/2/18
Looking for experienced DevOp for a Data Scientist

Thank You
 Any Questions?!
5213/2/18

JOSA TechTalks - Word Embedding and Word2Vec Explained

More Related Content

Similar to JOSA TechTalks - Word Embedding and Word2Vec Explained

More from Jordan Open Source Association

Recently uploaded

JOSA TechTalks - Word Embedding and Word2Vec Explained

Editor's Notes