SlideShare a Scribd company logo
1 of 37
Download to read offline
word2vec in theory and practice with tensorflow


www.bgoncalves.com
https://github.com/bmtgoncalves/word2vec-and-friends/
www.bgoncalves.com@bgoncalves
• Computers are really good at crunching numbers but not so much when it
comes to words.
• Perhaps can we represent words numerically?



• Can we do it in a way that preserves semantic information?





• Words that have similar meanings are used in similar contexts and the context
in which a word is used helps us understand it’s meaning.
Teaching machines to read!
The red house is beautiful.

The blue house is old.

The red car is beautiful.

The blue car is old.
“You shall know a word by the company it keeps”

(J. R. Firth)
a 1
about 2
above 3
after 4
again 5
against 6
all 7
am 8
an 9
and 10
any 11
are 12
aren't 13
as 14
… …
vafter = (0, 0, 0, 1, 0, 0, · · · )
T
vabove = (0, 0, 1, 0, 0, 0, · · · )
T
One-hot
encoding
www.bgoncalves.com@bgoncalves
Teaching machines to read!
➡Words with similar meanings should have similar representations.
➡From a word we can get some idea about the context where it might appear 



➡And from the context we have some idea about possible words
“You shall know a word by the company it keeps”

(J. R. Firth)
max p (C|w)
max p (w|C)
The red _____ is beautiful.

The blue _____ is old.
___ ___ house __ ____.

___ ___ car __ _______.

www.bgoncalves.com@bgoncalves
word2vec
max p (C|w) max p (w|C)
Skipgram Continuous Bag of Words
⇥1
wj1wj+1
wj
⇥2
⇥2
wj+1
⇥2
⇥2
⇥1
wj
wj1
Word Context Context Word
⇥1
⇥2
wj
word embeddings
context embeddings
one hot vector
activation function
Mikolov 2013
www.bgoncalves.com@bgoncalves
• Let us take a better look at a simplified case with a single context word
• Words are one-hot encoded vectors of length V
• is an matrix so that when we take the product:
• We are effectively selecting the j’th column of :
• The linear activation function simply passes this value along

which is then multiplied by , a matrix.
• Each element k of the output layer its then given by:
• We convert these values to a normalized probability distribution by using the softmax
Skipgram
⇥1
softmax
wj
⇥2
wj+1
wj = (0, 0, 1, 0, 0, 0, · · · )
T
⇥1 (M ⇥ V )
⇥1 · wj
⇥1
vj = ⇥1 · wj
⇥2 (V ⇥ M)
uT
k · vj
www.bgoncalves.com@bgoncalves
• A standard way of converting a set of number to a normalized probability distribution:



• With this final ingredient we obtain:



• Our goal is then to learn:
• so that we can predict what the next word is likely to be using:

• But how can we quantify how far we are from the correct answer? Our error measure
shouldn’t be just binary (right or wrong)…
Softmax
⇥1
softmax
wj
⇥2
wj+1
softmax (x) =
exp (xj)
P
l exp (xl)
p (wk|wj) ⌘ softmax uT
k · vj =
exp uT
k · vj
P
l exp uT
l · vj
p (wj+1|wj)
⇥1 ⇥2
www.bgoncalves.com@bgoncalves
Cross-Entropy
• First we have to recall that what we are, in effect, comparing two probability distributions:
• and the one-hot encoding of the context:
• The Cross Entropy measures the distance, in number of bits, between two probability
distributions p and q: 

• In our case, this becomes:
• So it’s clear that the only non zero term is the one that corresponds to the “hot” element of 

• This is our Error function. But how can we use this to update the values of and ?
p (wk|wj)
H (p, q) =
X
k
pk log qk
H = log p (wj+1|wj)
wj+1
wj+1 = (0, 0, 0, 1, 0, 0, · · · )
T
H [wj+1, p (wk|wj)] =
X
k
wk
j+1 log p (wk|wj)
⇥1 ⇥2
www.bgoncalves.com@bgoncalves
Gradient Descent
• Find the gradient for each training batch
• Take a step downhill along the
direction of the gradient 



• where is the step size.
• Repeat until “convergence”.
H
✓mn ✓mn ↵
@H
@✓mn
@H
@✓mn
↵
www.bgoncalves.com@bgoncalves
Chain-rule
• How can we calculate

• we rewrite:

• and expand:
• Then we can rewrite:

• and apply the chain rule:
@H
@✓mn
=
@
@✓mn
log p (wj+1|wj)
@H
@✓mn
=
@
@✓mn
log
exp uT
k · vj
P
l exp uT
l · vj
uT
k · vj =
X
q
✓
(2)
kq ✓
(1)
qj
@f (g (x))
@x
=
@f (g (x))
@g (x)
@g (x)
@x
@H
@✓mn
=
@
@✓mn
"
uT
k · vj log
X
l
exp uT
l · vj
#
✓mn =
n
✓(1)
mn, ✓(2)
mn
o
www.bgoncalves.com@bgoncalves
SkipGram with Larger Contexts
• Use the same for all context words.
• Use the average of cross entropy.
• word order is not important (the average does not change).
⇥1
softmax
wj
⇥2
wj+1 ⇥1
wj1wj+1
wj
⇥2
⇥2
H = log p (wj+1|wj) H =
1
T
X
t
log p (wj+t|wj)
⇥2
www.bgoncalves.com@bgoncalves
Continuous Bag of Words
• The process is essentially the same
wj+1
⇥2
⇥2
⇥1
wj
wj1
www.bgoncalves.com@bgoncalves
Variations
• Hierarchical Softmax:
• Approximate the softmax using a binary tree
• Reduce the number of calculations per training example from to and
increase performance by orders of magnitude.
• Negative Sampling:
• Under sample the most frequent words by removing them from the text before
generating the contexts
• Similar idea to removing stop-words — very frequent words are less informative.
• Effectively makes the window larger, increasing the amount of information available for
context
V log2 V
www.bgoncalves.com@bgoncalves
Comments
• word2vec, even in its original formulation is actually a family
of algorithms using various combinations of:
• Skip-gram, CBOW
• Hierarchical Softmax, Negative Sampling
• The output of this neural network is deterministic:
• If two words appear in the same context (“blue” vs “red”,
for e.g.), they will have similar internal representations in 

and
• and are vector embeddings of the input words
and the context words respectively
• Words that are too rare are also removed.
• The original implementation had a dynamic window size:
• for each word in the corpus a window size k’ is
sampled uniformly between 1 and k
⇥1
wj1wj+1
wj
⇥2
⇥2
⇥1 ⇥2
⇥1 ⇥2
www.bgoncalves.com@bgoncalves
Online resources
• C - https://code.google.com/archive/p/word2vec/ (the original one)
• Python/tensorflow - https://www.tensorflow.org/tutorials/word2vec
• Both a minimalist and an efficient versions are available in the tutorial
• Python/gensim - https://radimrehurek.com/gensim/models/word2vec.html
• Pretrained embeddings:
• 90 languages, trained using wikipedia: https://github.com/facebookresearch/fastText/blob/
master/pretrained-vectors.md
www.bgoncalves.com@bgoncalves
Analogies
• The embedding of each word is a function of the context it appears in:

• words that appear in similar contexts will have similar embeddings:
• “Distributional hypotesis” in linguistics
(red) = f (context (red))
context (red) ⇡ context (blue) =) (red) ⇡ (blue)
“You shall know a word by the company it keeps”

(J. R. Firth)
Geometrical relations
between contexts imply
semantic relations
between words!
Paris
Rome
Washington DC
Lisbon
France
PortugalItaly
USA
Capital context
Country context
(France) (Paris) + (Rome) = (Italy)
www.bgoncalves.com@bgoncalves
Visualization
https://github.com/bmtgoncalves/word2vec-and-friends/
www.bgoncalves.com@bgoncalves
Tensorflow
www.bgoncalves.com@bgoncalves
• Let’s imagine I want to perform these calculations:







• for some given .
• To calculate we must follow a certain sequence of operations.
• Which can be shortened if we are interested in just the value of
• In Tensorflow, this is called a Computational Graph and it’s the
most fundamental concept to understand
• Data, in the form of tensors, flows through the graph from inputs
to outputs
• Tensorflow, is, essentially, a way of defining arbitrary computational
graphs in a way that can be automatically distributed and
optimized.
A diversion… https://www.tensorflow.org/
y = f (x)
z = g (y)
x
z
Apply
Assign
Apply
Assign
y
www.bgoncalves.com@bgoncalves
• If we use base functions, tensorflow knows how to automatically calculate the respective
gradients
• Automatic BackProp
• Graphs can have multiple outputs
• Predictions
• Cost functions
• etc…
Computational Graphs https://www.tensorflow.org/
www.bgoncalves.com@bgoncalves
Sessions
• After we have defined the computational graph, we can start using it
to make calculations
• All computations must take place within a “session” that defines the
values of all required input values
https://www.tensorflow.org/
• Which values are required for a specific computation depend on
what part of the graph is actually being executed.
• When you request the value of a specific output, tensorflow
determines what is the specific subgraph that must be executed
and what are the required input values.
• For optimization purposes, it can also execute independent parts of
the graph in different devices (CPUs, GPUs, TPUs, etc) at the same
time.
@bgoncalves
A basic Tensorflow program https://www.tensorflow.org/
import tensorflow as tf
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
c = tf.constant(3.)
m = tf.add(x, y)
z = tf.multiply(m, c)
with tf.Session() as sess:
output = sess.run(z, feed_dict={x: 1., y: 2.})
print("Output value is:", output)
basic.py
z = c ⇤ (x + y)
Placeholders
Constant
add
multiply
assign
assign
@bgoncalves
A basic Tensorflow program https://www.tensorflow.org/
import tensorflow as tf
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
c = tf.constant(3.)
m = tf.add(x, y)
z = tf.multiply(m, c)
with tf.Session() as sess:
output = sess.run(z, feed_dict={x: 1., y: 2.})
print("Output value is:", output)
z = c ⇤ (x + y)
Placeholders
Constant
add
multiply
assign
1 2
assign
9
Inputs
@bgoncalves
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
learning_rate = 0.01
N = 100
N_steps = 300
# Training Data
train_X = np.linspace(-10, 10, N)
train_Y = 2*train_X + 3 + 5*np.random.random(N)
# Computational Graph
X = tf.placeholder("float")
Y = tf.placeholder("float")
W = tf.Variable(np.random.randn(), name="weight")
b = tf.Variable(np.random.randn(), name="bias")
y = tf.add(tf.multiply(X, W), b)
cost = tf.reduce_mean(tf.pow(y-Y, 2))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for step in range(N_steps):
sess.run(optimizer, feed_dict={X: train_X, Y: train_Y})
cost_val, W_val, b_val = sess.run([cost, W, b], feed_dict={X: train_X, Y:train_Y})
print("step", step, "cost", cost_val, "w", W_val, "b", b_val)
Linear Regression
linear.py
y = W ⇤ x + b
cost =
1
N
X
i
(yi Yi)
2
@bgoncalves
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
learning_rate = 0.01
N = 100
N_steps = 300
# Training Data
train_X = np.linspace(-10, 10, N)
train_Y = 2*train_X + 3 + 5*np.random.random(N)
# Computational Graph
X = tf.placeholder("float")
Y = tf.placeholder("float")
W = tf.Variable(np.random.randn(), name="weight")
b = tf.Variable(np.random.randn(), name="bias")
y = tf.add(tf.multiply(X, W), b)
cost = tf.reduce_mean(tf.pow(y-Y, 2))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for step in range(N_steps):
sess.run(optimizer, feed_dict={X: train_X, Y: train_Y})
cost_val, W_val, b_val = sess.run([cost, W, b], feed_dict={X: train_X, Y:train_Y})
print("step", step, "cost", cost_val, "w", W_val, "b", b_val)
Linear Regression
linear.py
y = W ⇤ x + b
cost =
1
N
X
i
(yi Yi)
2
Jupyter Notebook
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow

More Related Content

What's hot

The Logical Burrito - pattern matching, term rewriting and unification
The Logical Burrito - pattern matching, term rewriting and unificationThe Logical Burrito - pattern matching, term rewriting and unification
The Logical Burrito - pattern matching, term rewriting and unification
Norman Richards
 

What's hot (20)

Mastering the game of Go with deep neural networks and tree search (article o...
Mastering the game of Go with deep neural networks and tree search (article o...Mastering the game of Go with deep neural networks and tree search (article o...
Mastering the game of Go with deep neural networks and tree search (article o...
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
The Logical Burrito - pattern matching, term rewriting and unification
The Logical Burrito - pattern matching, term rewriting and unificationThe Logical Burrito - pattern matching, term rewriting and unification
The Logical Burrito - pattern matching, term rewriting and unification
 
core.logic introduction
core.logic introductioncore.logic introduction
core.logic introduction
 
Study material ip class 12th
Study material ip class 12thStudy material ip class 12th
Study material ip class 12th
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
 
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
 
Class 26: Objectifying Objects
Class 26: Objectifying ObjectsClass 26: Objectifying Objects
Class 26: Objectifying Objects
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
 
From Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and ScaleFrom Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and Scale
 
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
Sleep Period Optimization Model For Layered Video Service Delivery Over eMBMS...
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
 
Intro to io
Intro to ioIntro to io
Intro to io
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithm
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
 
Hathor@FGV: Introductory notes by Dr. A. Linhares
Hathor@FGV: Introductory notes by Dr. A. LinharesHathor@FGV: Introductory notes by Dr. A. Linhares
Hathor@FGV: Introductory notes by Dr. A. Linhares
 
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Polyglot Persistence in the Real World: Cassandra + S3 + MapReducePolyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
 

Viewers also liked

Viewers also liked (6)

Twitterology - The Science of Twitter
Twitterology - The Science of TwitterTwitterology - The Science of Twitter
Twitterology - The Science of Twitter
 
Human Mobility (with Mobile Devices)
Human Mobility (with Mobile Devices)Human Mobility (with Mobile Devices)
Human Mobility (with Mobile Devices)
 
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing EconomyMining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
 
Making Sense of Data Big and Small
Making Sense of Data Big and SmallMaking Sense of Data Big and Small
Making Sense of Data Big and Small
 
Complenet 2017
Complenet 2017Complenet 2017
Complenet 2017
 
Mining Georeferenced Data
Mining Georeferenced DataMining Georeferenced Data
Mining Georeferenced Data
 

Similar to Word2vec in Theory Practice with TensorFlow

Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
Eman magdy
 
Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr - Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr -
PyData
 

Similar to Word2vec in Theory Practice with TensorFlow (20)

Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
Build Your Own VR Display Course - SIGGRAPH 2017: Part 2
Build Your Own VR Display Course - SIGGRAPH 2017: Part 2Build Your Own VR Display Course - SIGGRAPH 2017: Part 2
Build Your Own VR Display Course - SIGGRAPH 2017: Part 2
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
Klee and angr
Klee and angrKlee and angr
Klee and angr
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
02-alignment.pdf
02-alignment.pdf02-alignment.pdf
02-alignment.pdf
 
Asymptotic Notations.pptx
Asymptotic Notations.pptxAsymptotic Notations.pptx
Asymptotic Notations.pptx
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Introduction to C ++.pptx
Introduction to C ++.pptxIntroduction to C ++.pptx
Introduction to C ++.pptx
 
Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02
 
Introduction to Boost regex
Introduction to Boost regexIntroduction to Boost regex
Introduction to Boost regex
 
WWX14 speech : Justin Donaldson "Promhx : Cross-platform Promises and Reactiv...
WWX14 speech : Justin Donaldson "Promhx : Cross-platform Promises and Reactiv...WWX14 speech : Justin Donaldson "Promhx : Cross-platform Promises and Reactiv...
WWX14 speech : Justin Donaldson "Promhx : Cross-platform Promises and Reactiv...
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
 
2 BytesC++ course_2014_c1_basicsc++
2 BytesC++ course_2014_c1_basicsc++2 BytesC++ course_2014_c1_basicsc++
2 BytesC++ course_2014_c1_basicsc++
 
Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr - Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr -
 
tick cross game
tick cross gametick cross game
tick cross game
 

Recently uploaded

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 

Word2vec in Theory Practice with TensorFlow

  • 1. word2vec in theory and practice with tensorflow 
 www.bgoncalves.com https://github.com/bmtgoncalves/word2vec-and-friends/
  • 2. www.bgoncalves.com@bgoncalves • Computers are really good at crunching numbers but not so much when it comes to words. • Perhaps can we represent words numerically?
 
 • Can we do it in a way that preserves semantic information?
 
 
 • Words that have similar meanings are used in similar contexts and the context in which a word is used helps us understand it’s meaning. Teaching machines to read! The red house is beautiful.
 The blue house is old.
 The red car is beautiful.
 The blue car is old. “You shall know a word by the company it keeps”
 (J. R. Firth) a 1 about 2 above 3 after 4 again 5 against 6 all 7 am 8 an 9 and 10 any 11 are 12 aren't 13 as 14 … … vafter = (0, 0, 0, 1, 0, 0, · · · ) T vabove = (0, 0, 1, 0, 0, 0, · · · ) T One-hot encoding
  • 3. www.bgoncalves.com@bgoncalves Teaching machines to read! ➡Words with similar meanings should have similar representations. ➡From a word we can get some idea about the context where it might appear 
 
 ➡And from the context we have some idea about possible words “You shall know a word by the company it keeps”
 (J. R. Firth) max p (C|w) max p (w|C) The red _____ is beautiful.
 The blue _____ is old. ___ ___ house __ ____.
 ___ ___ car __ _______.

  • 4. www.bgoncalves.com@bgoncalves word2vec max p (C|w) max p (w|C) Skipgram Continuous Bag of Words ⇥1 wj1wj+1 wj ⇥2 ⇥2 wj+1 ⇥2 ⇥2 ⇥1 wj wj1 Word Context Context Word ⇥1 ⇥2 wj word embeddings context embeddings one hot vector activation function Mikolov 2013
  • 5. www.bgoncalves.com@bgoncalves • Let us take a better look at a simplified case with a single context word • Words are one-hot encoded vectors of length V • is an matrix so that when we take the product: • We are effectively selecting the j’th column of : • The linear activation function simply passes this value along
 which is then multiplied by , a matrix. • Each element k of the output layer its then given by: • We convert these values to a normalized probability distribution by using the softmax Skipgram ⇥1 softmax wj ⇥2 wj+1 wj = (0, 0, 1, 0, 0, 0, · · · ) T ⇥1 (M ⇥ V ) ⇥1 · wj ⇥1 vj = ⇥1 · wj ⇥2 (V ⇥ M) uT k · vj
  • 6. www.bgoncalves.com@bgoncalves • A standard way of converting a set of number to a normalized probability distribution:
 
 • With this final ingredient we obtain:
 
 • Our goal is then to learn: • so that we can predict what the next word is likely to be using:
 • But how can we quantify how far we are from the correct answer? Our error measure shouldn’t be just binary (right or wrong)… Softmax ⇥1 softmax wj ⇥2 wj+1 softmax (x) = exp (xj) P l exp (xl) p (wk|wj) ⌘ softmax uT k · vj = exp uT k · vj P l exp uT l · vj p (wj+1|wj) ⇥1 ⇥2
  • 7. www.bgoncalves.com@bgoncalves Cross-Entropy • First we have to recall that what we are, in effect, comparing two probability distributions: • and the one-hot encoding of the context: • The Cross Entropy measures the distance, in number of bits, between two probability distributions p and q: 
 • In our case, this becomes: • So it’s clear that the only non zero term is the one that corresponds to the “hot” element of 
 • This is our Error function. But how can we use this to update the values of and ? p (wk|wj) H (p, q) = X k pk log qk H = log p (wj+1|wj) wj+1 wj+1 = (0, 0, 0, 1, 0, 0, · · · ) T H [wj+1, p (wk|wj)] = X k wk j+1 log p (wk|wj) ⇥1 ⇥2
  • 8. www.bgoncalves.com@bgoncalves Gradient Descent • Find the gradient for each training batch • Take a step downhill along the direction of the gradient 
 
 • where is the step size. • Repeat until “convergence”. H ✓mn ✓mn ↵ @H @✓mn @H @✓mn ↵
  • 9. www.bgoncalves.com@bgoncalves Chain-rule • How can we calculate
 • we rewrite:
 • and expand: • Then we can rewrite:
 • and apply the chain rule: @H @✓mn = @ @✓mn log p (wj+1|wj) @H @✓mn = @ @✓mn log exp uT k · vj P l exp uT l · vj uT k · vj = X q ✓ (2) kq ✓ (1) qj @f (g (x)) @x = @f (g (x)) @g (x) @g (x) @x @H @✓mn = @ @✓mn " uT k · vj log X l exp uT l · vj # ✓mn = n ✓(1) mn, ✓(2) mn o
  • 10. www.bgoncalves.com@bgoncalves SkipGram with Larger Contexts • Use the same for all context words. • Use the average of cross entropy. • word order is not important (the average does not change). ⇥1 softmax wj ⇥2 wj+1 ⇥1 wj1wj+1 wj ⇥2 ⇥2 H = log p (wj+1|wj) H = 1 T X t log p (wj+t|wj) ⇥2
  • 11. www.bgoncalves.com@bgoncalves Continuous Bag of Words • The process is essentially the same wj+1 ⇥2 ⇥2 ⇥1 wj wj1
  • 12. www.bgoncalves.com@bgoncalves Variations • Hierarchical Softmax: • Approximate the softmax using a binary tree • Reduce the number of calculations per training example from to and increase performance by orders of magnitude. • Negative Sampling: • Under sample the most frequent words by removing them from the text before generating the contexts • Similar idea to removing stop-words — very frequent words are less informative. • Effectively makes the window larger, increasing the amount of information available for context V log2 V
  • 13. www.bgoncalves.com@bgoncalves Comments • word2vec, even in its original formulation is actually a family of algorithms using various combinations of: • Skip-gram, CBOW • Hierarchical Softmax, Negative Sampling • The output of this neural network is deterministic: • If two words appear in the same context (“blue” vs “red”, for e.g.), they will have similar internal representations in 
 and • and are vector embeddings of the input words and the context words respectively • Words that are too rare are also removed. • The original implementation had a dynamic window size: • for each word in the corpus a window size k’ is sampled uniformly between 1 and k ⇥1 wj1wj+1 wj ⇥2 ⇥2 ⇥1 ⇥2 ⇥1 ⇥2
  • 14. www.bgoncalves.com@bgoncalves Online resources • C - https://code.google.com/archive/p/word2vec/ (the original one) • Python/tensorflow - https://www.tensorflow.org/tutorials/word2vec • Both a minimalist and an efficient versions are available in the tutorial • Python/gensim - https://radimrehurek.com/gensim/models/word2vec.html • Pretrained embeddings: • 90 languages, trained using wikipedia: https://github.com/facebookresearch/fastText/blob/ master/pretrained-vectors.md
  • 15. www.bgoncalves.com@bgoncalves Analogies • The embedding of each word is a function of the context it appears in:
 • words that appear in similar contexts will have similar embeddings: • “Distributional hypotesis” in linguistics (red) = f (context (red)) context (red) ⇡ context (blue) =) (red) ⇡ (blue) “You shall know a word by the company it keeps”
 (J. R. Firth) Geometrical relations between contexts imply semantic relations between words! Paris Rome Washington DC Lisbon France PortugalItaly USA Capital context Country context (France) (Paris) + (Rome) = (Italy)
  • 19. www.bgoncalves.com@bgoncalves • Let’s imagine I want to perform these calculations:
 
 
 
 • for some given . • To calculate we must follow a certain sequence of operations. • Which can be shortened if we are interested in just the value of • In Tensorflow, this is called a Computational Graph and it’s the most fundamental concept to understand • Data, in the form of tensors, flows through the graph from inputs to outputs • Tensorflow, is, essentially, a way of defining arbitrary computational graphs in a way that can be automatically distributed and optimized. A diversion… https://www.tensorflow.org/ y = f (x) z = g (y) x z Apply Assign Apply Assign y
  • 20. www.bgoncalves.com@bgoncalves • If we use base functions, tensorflow knows how to automatically calculate the respective gradients • Automatic BackProp • Graphs can have multiple outputs • Predictions • Cost functions • etc… Computational Graphs https://www.tensorflow.org/
  • 21. www.bgoncalves.com@bgoncalves Sessions • After we have defined the computational graph, we can start using it to make calculations • All computations must take place within a “session” that defines the values of all required input values https://www.tensorflow.org/ • Which values are required for a specific computation depend on what part of the graph is actually being executed. • When you request the value of a specific output, tensorflow determines what is the specific subgraph that must be executed and what are the required input values. • For optimization purposes, it can also execute independent parts of the graph in different devices (CPUs, GPUs, TPUs, etc) at the same time.
  • 22. @bgoncalves A basic Tensorflow program https://www.tensorflow.org/ import tensorflow as tf x = tf.placeholder(tf.float32) y = tf.placeholder(tf.float32) c = tf.constant(3.) m = tf.add(x, y) z = tf.multiply(m, c) with tf.Session() as sess: output = sess.run(z, feed_dict={x: 1., y: 2.}) print("Output value is:", output) basic.py z = c ⇤ (x + y) Placeholders Constant add multiply assign assign
  • 23. @bgoncalves A basic Tensorflow program https://www.tensorflow.org/ import tensorflow as tf x = tf.placeholder(tf.float32) y = tf.placeholder(tf.float32) c = tf.constant(3.) m = tf.add(x, y) z = tf.multiply(m, c) with tf.Session() as sess: output = sess.run(z, feed_dict={x: 1., y: 2.}) print("Output value is:", output) z = c ⇤ (x + y) Placeholders Constant add multiply assign 1 2 assign 9 Inputs
  • 24. @bgoncalves import numpy as np import matplotlib.pyplot as plt import tensorflow as tf learning_rate = 0.01 N = 100 N_steps = 300 # Training Data train_X = np.linspace(-10, 10, N) train_Y = 2*train_X + 3 + 5*np.random.random(N) # Computational Graph X = tf.placeholder("float") Y = tf.placeholder("float") W = tf.Variable(np.random.randn(), name="weight") b = tf.Variable(np.random.randn(), name="bias") y = tf.add(tf.multiply(X, W), b) cost = tf.reduce_mean(tf.pow(y-Y, 2)) optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) for step in range(N_steps): sess.run(optimizer, feed_dict={X: train_X, Y: train_Y}) cost_val, W_val, b_val = sess.run([cost, W, b], feed_dict={X: train_X, Y:train_Y}) print("step", step, "cost", cost_val, "w", W_val, "b", b_val) Linear Regression linear.py y = W ⇤ x + b cost = 1 N X i (yi Yi) 2
  • 25. @bgoncalves import numpy as np import matplotlib.pyplot as plt import tensorflow as tf learning_rate = 0.01 N = 100 N_steps = 300 # Training Data train_X = np.linspace(-10, 10, N) train_Y = 2*train_X + 3 + 5*np.random.random(N) # Computational Graph X = tf.placeholder("float") Y = tf.placeholder("float") W = tf.Variable(np.random.randn(), name="weight") b = tf.Variable(np.random.randn(), name="bias") y = tf.add(tf.multiply(X, W), b) cost = tf.reduce_mean(tf.pow(y-Y, 2)) optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) for step in range(N_steps): sess.run(optimizer, feed_dict={X: train_X, Y: train_Y}) cost_val, W_val, b_val = sess.run([cost, W, b], feed_dict={X: train_X, Y:train_Y}) print("step", step, "cost", cost_val, "w", W_val, "b", b_val) Linear Regression linear.py y = W ⇤ x + b cost = 1 N X i (yi Yi) 2