Word Embedding - Word2Vec and
Relatives
13/2/18 1
Wael Farhan - Mawdoo3
University of California, San Diego
JOSA
Jordan Open Source Association
Goal of the talk
If you don’t know Word2Vec:
Learn what Word2Vec does and why it is useful.
If you know word2vec:
Learn how to use it.
If you already used Word2Vec:
Learn how it works under the hood.
If you know how Word2Vec works:
Check out different variations(FastText) and different usage scenarios.
If you already used it:
Share your experience!!!
213/2/18
Text, text, and even more text
 A lot of our businesses rely on serving new content for our users
 Blogs, QnA, social media, dictionaries, books …
 Text is still going to be generated on a larger scale.
313/2/18
Problem?
Not getting a lot of sense out of this large corpus of text
413/2/18
Goals of Natural Language Processing
 Similar Words - Enhanced Search Engine
 Find synonyms for words
 Sentiment Analysis
 Determine if the sentence is positive or negative
 Machine Translation
 machine translate text from one human language to another
 Question and Answer
 Given a question find the correct answer.
 Categorization
 Automatically classify a question, a book, a description … etc.
513/2/18
Nature of NLP Problems
613/2/18
NLP
Model
Input Sentence
Example:
The new song is gorgeous
Output
Can be a translation:
‫األغنية‬‫الجديدة‬‫رائعة‬
Or prediction:
Positive Sentiment
Word Representation
713/2/18
~170,000
Words
In order to process natural language we need to represent words
in a way computer (Neural Networks) can understand
1
0
0
…
0
0
0
“a”
0
1
0
…
0
0
0
“abbreviations”
0
0
0
…
0
1
0
“zoology”
0
0
0
…
0
0
1
“zoom”
…
0
1
2
99,997
99,998
99,999
One Hot Representation
Goal
Is to find a better word representation that captures
word semantic meaning and relations.
Consume less space and make NLP models train
faster and more effective.
813/2/18
Word Properties
913/2/18
 Words have similar properties:
The cat is eating dinner
The dog is eating dinner
 Morphological parsing:
He works hard
He worked hard
 Word relationships:
Berlin is the capital of Germany
Rome is the capital of Italy
 Synonyms and Antonyms
 … etc.
Hypothetical Word Space
1013/2/18
Gender
Tense
Singular/Plural
‫ذهبت‬‫ذهب‬
‫ذهبن‬‫ذهبوا‬
‫يذهبون‬
‫يذهب‬ ‫تذهب‬
(0,0,1)
(0,1,1)
(1,0,1)
(0,1,-1)
(1,0,-1)
(0,0,-1)
‫يذهبن‬
(1,1,1)(1,1,-1)
‫ذهبا‬
(0.5,0,-1)
Benefits
With the magic of Word2Vec we are able to extract the
dimensions that capture the semantic meaning and word
properties of words.
1113/2/18
Example - 1
1213/2/18https://github.com/wael34218/word_embeddings
Example - 1
1313/2/18https://github.com/wael34218/word_embeddings
Example - 1
1413/2/18https://github.com/wael34218/word_embeddings
Example - 2
 In the his research paper Mikolov mentions the following example:
vec("King") - vec("Man") + vec("Woman") ≈ vec (”Queen")
1513/2/18
Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
Man
Woman
King
Queen
Example - 3
1613/2/18
China
Beijing
Russia
Moscow
Japan
Tokyo
Turkey
Ankara
Germany
Berlin
France Paris
Italy
Rome
Spain Madrid
Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.
Cosine Similarity
1713/2/18
A . B
|A | |B|
similarity(A, B) =
Apple
Banana
Car
similarity ≈ 1
similarity ≈ 0
How Word2Vec Helps
 Similar Words - Enhanced Search Engine
 Finding the nearest words to your query
 Sentiment Analysis
 Few dimensions can indicate whether the sentiment is good or bad
 Machine Translation & Question and Answer
 Similar words will be treated the same.
 Categorization
 Words from the same field (politics, sports …etc.) will be clustered in the
same area.
1813/2/18
Milestones
If you don’t know Word2Vec:
Learn what Word2Vec does and why it is useful.
If you know word2vec:
Learn how to use it.
If you already used Word2Vec:
Learn how it works under the hood.
If you know how Word2Vec works:
Check out different variations(FastText) and different usage scenarios.
If you already used it:
Share your experience!!!
1913/2/18
How to use Word2Vec - Python
2013/2/18
Vector Lookup - Python
2113/2/18
array([ 2.3184156e+00, -3.7887740e+00, -1.5767757e+00, 2.0226068e+00,
-6.2421966e-01, -1.9068855e+00, 4.3609171e+00, 5.2191928e-02,
-2.1188414e+00, -2.3100405e+00, 5.8612001e-01, 1.3959754e+00,
-1.3236507e+00, 5.3390409e-05, -3.2201216e-01, 2.6455419e+00,
4.5713658e+00, 4.3059942e-01, -9.1906101e-01, -1.8534625e-02,
-2.7166015e-01, -2.2993209e+00, 7.8024969e-02, -3.2237511e+00,
3.3045592e+00, -1.0334913e+00, 1.4119573e+00, 3.7495871e+00,
2.8075716e+00, 1.0440959e-01, -3.9444833e+00, -2.2009590e-01,
-2.9403548e+00, -1.4462236e+00, 2.4799142e+00, 7.7769762e-01,
5.4318172e-01, -2.6818683e+00, -3.0701482e+00, 4.3109632e+00,
-7.7415538e-01, 1.9786474e+00, 1.1503514e+00, 2.6723063e+00,
-1.5133847e+00, 1.4275682e-01, 3.5057294e-01, 6.3898432e-01,
9.9464995e-01, 1.7852293e+00, 9.5475733e-01, 2.9222999e+00,
-3.5561893e+00, 3.1446383e+00, -4.4377699e+00, -4.3674165e-01,
-1.9084896e-01, 2.8170996e+00, -3.0291042e+00, -1.1227336e+00,
-3.6801448e+00, -1.2687838e+00, 1.7091125e-01, -8.1778312e-01,
2.1771207e+00, -2.6653576e+00, -1.5208750e+00, -1.8047930e-01,
8.8296290e-03, -2.7885602e+00, -1.5657809e+00, -2.3738770e+00,
8.7824135e+00, -9.4801110e-01, 2.1755044e+00, -2.1538272e+00,
-5.9697658e-01, -8.5682195e-01, 2.5586643e+00, -4.3383533e-01,
-1.3269461e+00, 3.8761835e+00, -8.1207365e-02, -1.6046954e+00,
-4.4856617e-01, -3.2454314e+00, 2.5956264e+00, -3.6466745e-01,
7.7527708e-01, -7.4778008e+00, 2.3812482e+00, -4.6497111e+00,
-2.4220943e+00, 1.5012804e-01, -1.5416908e+00, -3.4357128e+00,
3.7048867e+00, -4.2515426e+00, 5.9101069e-01, -1.0800831e+00],
dtype=float32)
]’‫[’بيان‬model
Vector Most Similar - Python
2213/2/18
model.most_similar(['‫)]'بيان‬
Vector Most Similar - Python
2313/2/18
model.most_similar(['‫)]'متر‬
Vector Most Similar - Python
2413/2/18
model.most_similar(['‫)]'كندا‬
Milestones
If you don’t know Word2Vec:
Learn what Word2Vec does and why it is useful.
If you know word2vec:
Learn how to use it.
If you already used Word2Vec:
Learn how it works under the hood.
If you know how Word2Vec works:
Check out different variations(FastText) and different usage scenarios.
If you already used it:
Share your experience!!!
2513/2/18
Under the Hood
2613/2/18
Tomas Mikolov
Under the Hood
 Word2Vec is a simple neural network with one hidden layer.
 It learns to predict the missing word:
‫يوجد‬‫عدد‬ ‫األرض‬ ‫على‬‫المواد‬ ‫من‬ ‫كبير‬‫الكيميائية‬‫ذات‬ ‫الصلبة‬‫الصيغ‬‫المعقدة‬
 It has 2 different flavors:
 Continuous Bag of Words (CBoW)
 SkipGram
2713/2/18
Missing Word
Context (C) Window Size = 3
CBOW Architecture
Given a sentence (X) predict
what is the missing word (y).
 P(y|X)
 P(y=“is” | X = [“The”, “cat”,
“eating”, “dinner”]
 V: Vocab Size ~170k
 N: Vector embedding size
 W: Weights
2813/2/18
…………………………………………
………
……………
Hidden Layer Output Layer
Input Layer
W’NxV
WNxV
WNxV
WNxV
N-dim
C x V-dim
V-dim
x1
x2
xc
yh
SkipGram Architecture
2913/2/18
Given a word (x) predict what
the rest of the sentence (Y).
 P(Y|x)
 P(Y = [“The”, “is”, “eating”,
“dinner”] | x=“cat”)
 V: Vocab Size ~170k
 N: Vector embedding size
 W: Weights
…………………………………………
………
……………
Hidden Layer
Input Layer
WNxV
N-dim
C x V-dim
V-dim
y1
y2
x
W’NxV
W’NxV
W’NxV
Output Layer
yc
h
Forward Feed
3013/2/18
x = one_hot_representation
W1 = tf.Variable(tf.random_normal([vocab_size, embedding_size]))
b1 = tf.Variable(tf.zeros([embedding_size], dtype=tf.float64))
Y1 = tf.add(tf.matmul(x, W1), b1)
W2 = tf.Variable(tf.random_normal([embedding_size, vocab_size]))
b2 = tf.Variable(tf.zeros([output_length], dtype=tf.float64))
pred = tf.nn.softmax(tf.add(tf.matmul(Y1, W2), b2))
x = OneHot[“dog”]
h = x * W1 W1 is a [V * N] matrix
Y = h * W2 W2 is a [N * V] matrix
P = Softmax(Y)
…………………………………………
………
……………
Hidden Layer
Input Layer
WNxV
N-dim
C x V-dim
V-dim
y1
y2
x
W’NxV
W’NxV
W’NxV
yc
h
Output Layer - Softmax
3113/2/18
yi = h * W2
j loops over vocab size
…………………………………………
………
……………
Hidden Layer
Input Layer
WNxV
N-dim
C x V-dim
V-dim
y1
y2
x
W’NxV
W’NxV
W’NxV
Output Layer
yc
h
Backpropagation
3213/2/18
1
1
Input
V dim
Hidden
N dim
Output
V dim
Output Target
0.1 0
0.2 0
0.6 1
0.1 0
Cross Entropy
Loss Function:
Gradient Descent
3313/2/18
w
J(w)
Stages of Learning
1- Prepare Dataset
2- Define your network (weights and topology)
3- Forward Propagation
4- Loss Function
5- Back-Propagation
6- Update Weights
7- Repeat from starting from 3 to 6
3413/2/18
Negative Sampling
 Calculating V probabilities to compute Softmax is
slow.
 We only select few randomly selected number of
“negative” words instead of the entire V words. In
addition to the positive word.
 Probability for each word being selected as a
negative sample is proportional to the frequency.
The more frequent the word is the higher
probability it has to be selected.
3513/2/18
Output Layer
V-dim
y
0.21
0.31
0.01
0.11
0.34
0.21
0.08
0.12
Minimum Count of Words
 Removing words with very low occurrences in the text.
 Not enough repetitions to learn accurate projection.
 Example:
If minimum count is set to 5, then fox will be dropped from the
sentence
“the quick brown jumps over the lazy dog”
3613/2/18
the quick brown fox jumps over the lazy dog
200 56 13 2 17 77 200 90 90
Subsampling Frequent Words
 Frequent words like “The” shows in many training examples.
 We will have much more training examples to learn the vector than
we need.
 Each word will have a probability of being deleted from the corpus
based on its frequency.
“the quick brown fox jumps over the lazy dog”
2 “the” words have high probability, the sentence could end up:
“the quick brown fox jumps over lazy dog”
3713/2/18
End Result
 Usually Deep Learning models are the output after training.
 In word2vec the weights is the output
3813/2/18
SkipGram CBoW
…………………………………
………
…………
Hidden Layer Output Layer
W’NxV
WNxV
WNxV
WNxV
N-dim
C x V-dim
V-dim
x1
x2
xc
yh
…………………………………
……
…………
Hidden Layer
Input Layer
WNxV
N-dim
C x V-dim
V-dim
y1
y2
x
W’NxV
W’NxV
W’NxV
Output Layer
yc
h
Evaluation
 Relatedness (Synonym words)
 Analogy (Resemblance of relationship)
 Categorization (Clusters of words: e.g. positive adjectives)
 Selection Preference (“People eat” and not “eat people”)
Problem: Building a Query Inventory
Easiest way is:
 Measure accuracy form the task at hand
3913/2/18
Schnabel, Tobias, et al. "Evaluation methods for unsupervised word embeddings." Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing. 2015.
Tuning Hyperparameters
 Embedding Size
 Window Size
 Model (CBoW or SkipGram)
 Negative Samples : between (5 - 20)
 Minimum Count
 Subsampling rate : between (0 – 1e-5)
 Learning Rate
 Batch Size
 Epochs
4013/2/18
Limitations
 Cannot Handle Out of Vocabulary.
 Hard to evaluate by itself.
 Can find different words similar:
‫الطالب‬ ‫ذهب‬‫النشيط‬‫المدرسة‬ ‫إلى‬
‫الطالب‬ ‫ذهب‬‫المشاغب‬‫المدرسة‬ ‫إلى‬
Months, days, colors … etc.
4113/2/18
Milestones
If you don’t know Word2Vec:
Learn what Word2Vec does and why it is useful.
If you know word2vec:
Learn how to use it.
If you already used Word2Vec:
Learn how it works under the hood.
If you know how Word2Vec works:
Check out different variations(FastText) and different usage scenarios.
If you already used it:
Share your experience!!!
4213/2/18
FastText
 The input is not only the word, but also its composite n-grams.
For example the word: swimming with 3-grams
^sw | swi | wim | imm | mmi | min | ing | ng$ | ^swimming$
 Network now needs to learn vector embeddings of subwords.
 The final vector of a word equals the sum of its subword vectors in
addition to its vector itself.
 Advantages:
 Handles Out Of Vocabulary (OOV) by approximating the vector
representation of a missing word.
4313/2/18
My First Use For Word2Vec
 Starting from a database :
Lab tests – Prescriptions – Diagnoses - Symptoms - Conditions
 Extract all medical events for particular patient.
 Sort events in chronological order
 Add prefix (‘l_’, ‘p_’, ‘d_’, ‘s_’, ‘c_’) to event ID.
Patient 512: S_253 l_50970 l_50971 p_DW100 … d_427 d_038
4413/2/18
Time
Farhan, Wael, et al. "A predictive model for medical events based on contextual embedding of temporal sequences." JMIR medical informatics 4.4 (2016).
Fever High HDL Cardiac Rhythm Abnormality
Medical Embeddings
V[Heart Failure] – V[Furosemide] ≈ V[Acute Renal Failure] – V[Sodium Cholride]
4513/2/18
Farhan, Wael, et al. "A predictive model for medical events based on contextual embedding of temporal sequences." JMIR medical informatics 4.4 (2016).
Doc2Vec
4613/2/18
Paragraph Vector - Distributed
Memory (PV-DM)
Paragraph Vector - Distributed Bag of
Words (PV-DBOW)
Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International Conference on Machine Learning. 2014.
t-distributed stochastic neighbor embedding
 t-SNE for short
 How did we get from 300 dimensions to 2?
 t-SNE is a dimensionality reduction algorithm.
 This is used in ML purposes to ease visualization.
4713/2/18
Summary
 Word2Vec is an unsupervised learning tool that generates vector
representation capturing semantic meaning of the words.
 It has 2 different architectures. CBoW and SkipGram
 Word2Vec boosted accuracies of almost all NLP problems
 The concept of Word2Vec can be generalized to other problems
like Medical, Stock Market, DNA analysis … etc.
4813/2/18
Milestones
If you don’t know Word2Vec:
Learn what Word2Vec does and why it is useful.
If you know word2vec:
Learn how to use it.
If you already used Word2Vec:
Learn how it works under the hood.
If you know how Word2Vec works:
Check out different variations(FastText) and different usage scenarios.
If you already used it:
Share your experience!!!
4913/2/18
Share Your Story
How was your experience with
Word2Vec?
5013/2/18
We are Hiring!!!
Send your CV:
wael.farhan@mawdoo3.com
5113/2/18
Looking for experienced DevOp for a Data Scientist
Thank You
 Any Questions?!
5213/2/18

JOSA TechTalks - Word Embedding and Word2Vec Explained

  • 1.
    Word Embedding -Word2Vec and Relatives 13/2/18 1 Wael Farhan - Mawdoo3 University of California, San Diego JOSA Jordan Open Source Association
  • 2.
    Goal of thetalk If you don’t know Word2Vec: Learn what Word2Vec does and why it is useful. If you know word2vec: Learn how to use it. If you already used Word2Vec: Learn how it works under the hood. If you know how Word2Vec works: Check out different variations(FastText) and different usage scenarios. If you already used it: Share your experience!!! 213/2/18
  • 3.
    Text, text, andeven more text  A lot of our businesses rely on serving new content for our users  Blogs, QnA, social media, dictionaries, books …  Text is still going to be generated on a larger scale. 313/2/18
  • 4.
    Problem? Not getting alot of sense out of this large corpus of text 413/2/18
  • 5.
    Goals of NaturalLanguage Processing  Similar Words - Enhanced Search Engine  Find synonyms for words  Sentiment Analysis  Determine if the sentence is positive or negative  Machine Translation  machine translate text from one human language to another  Question and Answer  Given a question find the correct answer.  Categorization  Automatically classify a question, a book, a description … etc. 513/2/18
  • 6.
    Nature of NLPProblems 613/2/18 NLP Model Input Sentence Example: The new song is gorgeous Output Can be a translation: ‫األغنية‬‫الجديدة‬‫رائعة‬ Or prediction: Positive Sentiment
  • 7.
    Word Representation 713/2/18 ~170,000 Words In orderto process natural language we need to represent words in a way computer (Neural Networks) can understand 1 0 0 … 0 0 0 “a” 0 1 0 … 0 0 0 “abbreviations” 0 0 0 … 0 1 0 “zoology” 0 0 0 … 0 0 1 “zoom” … 0 1 2 99,997 99,998 99,999 One Hot Representation
  • 8.
    Goal Is to finda better word representation that captures word semantic meaning and relations. Consume less space and make NLP models train faster and more effective. 813/2/18
  • 9.
    Word Properties 913/2/18  Wordshave similar properties: The cat is eating dinner The dog is eating dinner  Morphological parsing: He works hard He worked hard  Word relationships: Berlin is the capital of Germany Rome is the capital of Italy  Synonyms and Antonyms  … etc.
  • 10.
    Hypothetical Word Space 1013/2/18 Gender Tense Singular/Plural ‫ذهبت‬‫ذهب‬ ‫ذهبن‬‫ذهبوا‬ ‫يذهبون‬ ‫يذهب‬‫تذهب‬ (0,0,1) (0,1,1) (1,0,1) (0,1,-1) (1,0,-1) (0,0,-1) ‫يذهبن‬ (1,1,1)(1,1,-1) ‫ذهبا‬ (0.5,0,-1)
  • 11.
    Benefits With the magicof Word2Vec we are able to extract the dimensions that capture the semantic meaning and word properties of words. 1113/2/18
  • 12.
  • 13.
  • 14.
  • 15.
    Example - 2 In the his research paper Mikolov mentions the following example: vec("King") - vec("Man") + vec("Woman") ≈ vec (”Queen") 1513/2/18 Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013). Man Woman King Queen
  • 16.
    Example - 3 1613/2/18 China Beijing Russia Moscow Japan Tokyo Turkey Ankara Germany Berlin FranceParis Italy Rome Spain Madrid Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.
  • 17.
    Cosine Similarity 1713/2/18 A .B |A | |B| similarity(A, B) = Apple Banana Car similarity ≈ 1 similarity ≈ 0
  • 18.
    How Word2Vec Helps Similar Words - Enhanced Search Engine  Finding the nearest words to your query  Sentiment Analysis  Few dimensions can indicate whether the sentiment is good or bad  Machine Translation & Question and Answer  Similar words will be treated the same.  Categorization  Words from the same field (politics, sports …etc.) will be clustered in the same area. 1813/2/18
  • 19.
    Milestones If you don’tknow Word2Vec: Learn what Word2Vec does and why it is useful. If you know word2vec: Learn how to use it. If you already used Word2Vec: Learn how it works under the hood. If you know how Word2Vec works: Check out different variations(FastText) and different usage scenarios. If you already used it: Share your experience!!! 1913/2/18
  • 20.
    How to useWord2Vec - Python 2013/2/18
  • 21.
    Vector Lookup -Python 2113/2/18 array([ 2.3184156e+00, -3.7887740e+00, -1.5767757e+00, 2.0226068e+00, -6.2421966e-01, -1.9068855e+00, 4.3609171e+00, 5.2191928e-02, -2.1188414e+00, -2.3100405e+00, 5.8612001e-01, 1.3959754e+00, -1.3236507e+00, 5.3390409e-05, -3.2201216e-01, 2.6455419e+00, 4.5713658e+00, 4.3059942e-01, -9.1906101e-01, -1.8534625e-02, -2.7166015e-01, -2.2993209e+00, 7.8024969e-02, -3.2237511e+00, 3.3045592e+00, -1.0334913e+00, 1.4119573e+00, 3.7495871e+00, 2.8075716e+00, 1.0440959e-01, -3.9444833e+00, -2.2009590e-01, -2.9403548e+00, -1.4462236e+00, 2.4799142e+00, 7.7769762e-01, 5.4318172e-01, -2.6818683e+00, -3.0701482e+00, 4.3109632e+00, -7.7415538e-01, 1.9786474e+00, 1.1503514e+00, 2.6723063e+00, -1.5133847e+00, 1.4275682e-01, 3.5057294e-01, 6.3898432e-01, 9.9464995e-01, 1.7852293e+00, 9.5475733e-01, 2.9222999e+00, -3.5561893e+00, 3.1446383e+00, -4.4377699e+00, -4.3674165e-01, -1.9084896e-01, 2.8170996e+00, -3.0291042e+00, -1.1227336e+00, -3.6801448e+00, -1.2687838e+00, 1.7091125e-01, -8.1778312e-01, 2.1771207e+00, -2.6653576e+00, -1.5208750e+00, -1.8047930e-01, 8.8296290e-03, -2.7885602e+00, -1.5657809e+00, -2.3738770e+00, 8.7824135e+00, -9.4801110e-01, 2.1755044e+00, -2.1538272e+00, -5.9697658e-01, -8.5682195e-01, 2.5586643e+00, -4.3383533e-01, -1.3269461e+00, 3.8761835e+00, -8.1207365e-02, -1.6046954e+00, -4.4856617e-01, -3.2454314e+00, 2.5956264e+00, -3.6466745e-01, 7.7527708e-01, -7.4778008e+00, 2.3812482e+00, -4.6497111e+00, -2.4220943e+00, 1.5012804e-01, -1.5416908e+00, -3.4357128e+00, 3.7048867e+00, -4.2515426e+00, 5.9101069e-01, -1.0800831e+00], dtype=float32) ]’‫[’بيان‬model
  • 22.
    Vector Most Similar- Python 2213/2/18 model.most_similar(['‫)]'بيان‬
  • 23.
    Vector Most Similar- Python 2313/2/18 model.most_similar(['‫)]'متر‬
  • 24.
    Vector Most Similar- Python 2413/2/18 model.most_similar(['‫)]'كندا‬
  • 25.
    Milestones If you don’tknow Word2Vec: Learn what Word2Vec does and why it is useful. If you know word2vec: Learn how to use it. If you already used Word2Vec: Learn how it works under the hood. If you know how Word2Vec works: Check out different variations(FastText) and different usage scenarios. If you already used it: Share your experience!!! 2513/2/18
  • 26.
  • 27.
    Under the Hood Word2Vec is a simple neural network with one hidden layer.  It learns to predict the missing word: ‫يوجد‬‫عدد‬ ‫األرض‬ ‫على‬‫المواد‬ ‫من‬ ‫كبير‬‫الكيميائية‬‫ذات‬ ‫الصلبة‬‫الصيغ‬‫المعقدة‬  It has 2 different flavors:  Continuous Bag of Words (CBoW)  SkipGram 2713/2/18 Missing Word Context (C) Window Size = 3
  • 28.
    CBOW Architecture Given asentence (X) predict what is the missing word (y).  P(y|X)  P(y=“is” | X = [“The”, “cat”, “eating”, “dinner”]  V: Vocab Size ~170k  N: Vector embedding size  W: Weights 2813/2/18 ………………………………………… ……… …………… Hidden Layer Output Layer Input Layer W’NxV WNxV WNxV WNxV N-dim C x V-dim V-dim x1 x2 xc yh
  • 29.
    SkipGram Architecture 2913/2/18 Given aword (x) predict what the rest of the sentence (Y).  P(Y|x)  P(Y = [“The”, “is”, “eating”, “dinner”] | x=“cat”)  V: Vocab Size ~170k  N: Vector embedding size  W: Weights ………………………………………… ……… …………… Hidden Layer Input Layer WNxV N-dim C x V-dim V-dim y1 y2 x W’NxV W’NxV W’NxV Output Layer yc h
  • 30.
    Forward Feed 3013/2/18 x =one_hot_representation W1 = tf.Variable(tf.random_normal([vocab_size, embedding_size])) b1 = tf.Variable(tf.zeros([embedding_size], dtype=tf.float64)) Y1 = tf.add(tf.matmul(x, W1), b1) W2 = tf.Variable(tf.random_normal([embedding_size, vocab_size])) b2 = tf.Variable(tf.zeros([output_length], dtype=tf.float64)) pred = tf.nn.softmax(tf.add(tf.matmul(Y1, W2), b2)) x = OneHot[“dog”] h = x * W1 W1 is a [V * N] matrix Y = h * W2 W2 is a [N * V] matrix P = Softmax(Y) ………………………………………… ……… …………… Hidden Layer Input Layer WNxV N-dim C x V-dim V-dim y1 y2 x W’NxV W’NxV W’NxV yc h
  • 31.
    Output Layer -Softmax 3113/2/18 yi = h * W2 j loops over vocab size ………………………………………… ……… …………… Hidden Layer Input Layer WNxV N-dim C x V-dim V-dim y1 y2 x W’NxV W’NxV W’NxV Output Layer yc h
  • 32.
    Backpropagation 3213/2/18 1 1 Input V dim Hidden N dim Output Vdim Output Target 0.1 0 0.2 0 0.6 1 0.1 0 Cross Entropy Loss Function:
  • 33.
  • 34.
    Stages of Learning 1-Prepare Dataset 2- Define your network (weights and topology) 3- Forward Propagation 4- Loss Function 5- Back-Propagation 6- Update Weights 7- Repeat from starting from 3 to 6 3413/2/18
  • 35.
    Negative Sampling  CalculatingV probabilities to compute Softmax is slow.  We only select few randomly selected number of “negative” words instead of the entire V words. In addition to the positive word.  Probability for each word being selected as a negative sample is proportional to the frequency. The more frequent the word is the higher probability it has to be selected. 3513/2/18 Output Layer V-dim y 0.21 0.31 0.01 0.11 0.34 0.21 0.08 0.12
  • 36.
    Minimum Count ofWords  Removing words with very low occurrences in the text.  Not enough repetitions to learn accurate projection.  Example: If minimum count is set to 5, then fox will be dropped from the sentence “the quick brown jumps over the lazy dog” 3613/2/18 the quick brown fox jumps over the lazy dog 200 56 13 2 17 77 200 90 90
  • 37.
    Subsampling Frequent Words Frequent words like “The” shows in many training examples.  We will have much more training examples to learn the vector than we need.  Each word will have a probability of being deleted from the corpus based on its frequency. “the quick brown fox jumps over the lazy dog” 2 “the” words have high probability, the sentence could end up: “the quick brown fox jumps over lazy dog” 3713/2/18
  • 38.
    End Result  UsuallyDeep Learning models are the output after training.  In word2vec the weights is the output 3813/2/18 SkipGram CBoW ………………………………… ……… ………… Hidden Layer Output Layer W’NxV WNxV WNxV WNxV N-dim C x V-dim V-dim x1 x2 xc yh ………………………………… …… ………… Hidden Layer Input Layer WNxV N-dim C x V-dim V-dim y1 y2 x W’NxV W’NxV W’NxV Output Layer yc h
  • 39.
    Evaluation  Relatedness (Synonymwords)  Analogy (Resemblance of relationship)  Categorization (Clusters of words: e.g. positive adjectives)  Selection Preference (“People eat” and not “eat people”) Problem: Building a Query Inventory Easiest way is:  Measure accuracy form the task at hand 3913/2/18 Schnabel, Tobias, et al. "Evaluation methods for unsupervised word embeddings." Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015.
  • 40.
    Tuning Hyperparameters  EmbeddingSize  Window Size  Model (CBoW or SkipGram)  Negative Samples : between (5 - 20)  Minimum Count  Subsampling rate : between (0 – 1e-5)  Learning Rate  Batch Size  Epochs 4013/2/18
  • 41.
    Limitations  Cannot HandleOut of Vocabulary.  Hard to evaluate by itself.  Can find different words similar: ‫الطالب‬ ‫ذهب‬‫النشيط‬‫المدرسة‬ ‫إلى‬ ‫الطالب‬ ‫ذهب‬‫المشاغب‬‫المدرسة‬ ‫إلى‬ Months, days, colors … etc. 4113/2/18
  • 42.
    Milestones If you don’tknow Word2Vec: Learn what Word2Vec does and why it is useful. If you know word2vec: Learn how to use it. If you already used Word2Vec: Learn how it works under the hood. If you know how Word2Vec works: Check out different variations(FastText) and different usage scenarios. If you already used it: Share your experience!!! 4213/2/18
  • 43.
    FastText  The inputis not only the word, but also its composite n-grams. For example the word: swimming with 3-grams ^sw | swi | wim | imm | mmi | min | ing | ng$ | ^swimming$  Network now needs to learn vector embeddings of subwords.  The final vector of a word equals the sum of its subword vectors in addition to its vector itself.  Advantages:  Handles Out Of Vocabulary (OOV) by approximating the vector representation of a missing word. 4313/2/18
  • 44.
    My First UseFor Word2Vec  Starting from a database : Lab tests – Prescriptions – Diagnoses - Symptoms - Conditions  Extract all medical events for particular patient.  Sort events in chronological order  Add prefix (‘l_’, ‘p_’, ‘d_’, ‘s_’, ‘c_’) to event ID. Patient 512: S_253 l_50970 l_50971 p_DW100 … d_427 d_038 4413/2/18 Time Farhan, Wael, et al. "A predictive model for medical events based on contextual embedding of temporal sequences." JMIR medical informatics 4.4 (2016). Fever High HDL Cardiac Rhythm Abnormality
  • 45.
    Medical Embeddings V[Heart Failure]– V[Furosemide] ≈ V[Acute Renal Failure] – V[Sodium Cholride] 4513/2/18 Farhan, Wael, et al. "A predictive model for medical events based on contextual embedding of temporal sequences." JMIR medical informatics 4.4 (2016).
  • 46.
    Doc2Vec 4613/2/18 Paragraph Vector -Distributed Memory (PV-DM) Paragraph Vector - Distributed Bag of Words (PV-DBOW) Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International Conference on Machine Learning. 2014.
  • 47.
    t-distributed stochastic neighborembedding  t-SNE for short  How did we get from 300 dimensions to 2?  t-SNE is a dimensionality reduction algorithm.  This is used in ML purposes to ease visualization. 4713/2/18
  • 48.
    Summary  Word2Vec isan unsupervised learning tool that generates vector representation capturing semantic meaning of the words.  It has 2 different architectures. CBoW and SkipGram  Word2Vec boosted accuracies of almost all NLP problems  The concept of Word2Vec can be generalized to other problems like Medical, Stock Market, DNA analysis … etc. 4813/2/18
  • 49.
    Milestones If you don’tknow Word2Vec: Learn what Word2Vec does and why it is useful. If you know word2vec: Learn how to use it. If you already used Word2Vec: Learn how it works under the hood. If you know how Word2Vec works: Check out different variations(FastText) and different usage scenarios. If you already used it: Share your experience!!! 4913/2/18
  • 50.
    Share Your Story Howwas your experience with Word2Vec? 5013/2/18
  • 51.
    We are Hiring!!! Sendyour CV: wael.farhan@mawdoo3.com 5113/2/18 Looking for experienced DevOp for a Data Scientist
  • 52.
    Thank You  AnyQuestions?! 5213/2/18

Editor's Notes

  • #5 Exact search matches
  • #8 One hot representation. Is not feasible for mainly 2 reasons: Requires a lot of space … and very sparse which can cause a lot of problems … Cannot find similarities between words.
  • #10 We need a way to extract word semantics based on the context of the sentences they appear in.
  • #19 2ostath or mu3allem is treated the same
  • #21 If you are doing a Dictionary app. This could be used to find similar words. If you are doing a blog you could use it to find semantically similar articles.
  • #22 If you are doing a Dictionary app. This could be used to find similar words. If you are doing a blog you could use it to find semantically similar articles.
  • #23 If you are doing a Dictionary app. This could be used to find similar words. If you are doing a blog you could use it to find semantically similar articles.
  • #24 If you are doing a Dictionary app. This could be used to find similar words. If you are doing a blog you could use it to find semantically similar articles.
  • #25 If you are doing a Dictionary app. This could be used to find similar words. If you are doing a blog you could use it to find semantically similar articles.
  • #29 C is context w is word