SlideShare a Scribd company logo
1 of 21
Download to read offline
word2vec and friends


www.bgoncalves.com
www.bgoncalves.com@bgoncalves
• Computers are really good at crunching numbers but not so much when it
comes to words.
• Perhaps can we represent words numerically?



• Can we do it in a way that preserves semantic information?





• Words that have similar meanings are used in similar contexts and the context
in which a word is used helps us understand it’s meaning.
Teaching machines to read!
The red house is beautiful.

The blue house is old.

The red car is beautiful.

The blue car is old.
“You shall know a word by the company it keeps”

(J. R. Firth)
a 1
about 2
above 3
after 4
again 5
against 6
all 7
am 8
an 9
and 10
any 11
are 12
aren't 13
as 14
… …
vafter = (0, 0, 0, 1, 0, 0, · · · )
T
vabove = (0, 0, 1, 0, 0, 0, · · · )
T
One-hot
encoding
www.bgoncalves.com@bgoncalves
Teaching machines to read!
➡Words with similar meanings should have similar representations.
➡From a word we can get some idea about the context where it might appear 



➡And from the context we have some idea about possible words
“You shall know a word by the company it keeps”

(J. R. Firth)
max p (C|w)
max p (w|C)
The red _____ is beautiful.

The blue _____ is old.
___ ___ house __ ____.

___ ___ car __ _______.

www.bgoncalves.com@bgoncalves
word2vec
max p (C|w) max p (w|C)
Skipgram Continuous Bag of Words
⇥1
wj1wj+1
wj
⇥2
⇥2
wj+1
⇥2
⇥2
⇥1
wj
wj1
Word Context Context Word
⇥1
⇥2
wj
word embeddings
context embeddings
one hot vector
activation function
Mikolov 2013
www.bgoncalves.com@bgoncalves
• Let us take a better look at a simplified case with a single context word
• Words are one-hot encoded vectors of length V
• is an matrix so that when we take the product:
• We are effectively selecting the j’th column of :
• The linear activation function simply passes this value along

which is then multiplied by , a matrix.
• Each element k of the output layer its then given by:
• We convert these values to a normalized probability distribution by using the softmax
Skipgram
⇥1
softmax
wj
⇥2
wj+1
wj = (0, 0, 1, 0, 0, 0, · · · )
T
⇥1 (M ⇥ V )
⇥1 · wj
⇥1
vj = ⇥1 · wj
⇥2 (V ⇥ M)
uT
k · vj
www.bgoncalves.com@bgoncalves
• A standard way of converting a set of number to a normalized probability distribution:



• With this final ingredient we obtain:



• Our goal is then to learn:
• so that we can predict what the next word is likely to be using:

• But how can we quantify how far we are from the correct answer? Our error measure
shouldn’t be just binary (right or wrong)…
Softmax
⇥1
softmax
wj
⇥2
wj+1
softmax (x) =
exp (xj)
P
l exp (xl)
p (wk|wj) ⌘ softmax uT
k · vj =
exp uT
k · vj
P
l exp uT
l · vj
p (wj+1|wj)
⇥1 ⇥2
www.bgoncalves.com@bgoncalves
Cross-Entropy
• First we have to recall that what we are, in effect, comparing two probability distributions:
• and the one-hot encoding of the context:
• The Cross Entropy measures the distance, in number of bits, between two probability
distributions p and q: 

• In our case, this becomes:
• So it’s clear that the only non zero term is the one that corresponds to the “hot” element of 

• This is our Error function. But how can we use this to update the values of and ?
p (wk|wj)
H (p, q) =
X
k
pk log qk
H = log p (wj+1|wj)
wj+1
wj+1 = (0, 0, 0, 1, 0, 0, · · · )
T
H [wj+1, p (wk|wj)] =
X
k
wk
j+1 log p (wk|wj)
⇥1 ⇥2
www.bgoncalves.com@bgoncalves
Gradient Descent
• Find the gradient for each training batch
• Take a step downhill along the
direction of the gradient 



• where is the step size.
• Repeat until “convergence”.
H
✓mn ✓mn ↵
@H
@✓mn
@H
@✓mn
↵
www.bgoncalves.com@bgoncalves
Chain-rule
• How can we calculate

• we rewrite:

• and expand:
• Then we can rewrite:

• and apply the chain rule:
@H
@✓mn
=
@
@✓mn
log p (wj+1|wj)
@H
@✓mn
=
@
@✓mn
log
exp uT
k · vj
P
l exp uT
l · vj
uT
k · vj =
X
q
✓
(2)
kq ✓
(1)
qj
@f (g (x))
@x
=
@f (g (x))
@g (x)
@g (x)
@x
@H
@✓mn
=
@
@✓mn
"
uT
k · vj log
X
l
exp uT
l · vj
#
✓mn =
n
✓(1)
mn, ✓(2)
mn
o
www.bgoncalves.com@bgoncalves
SkipGram with Larger Contexts
• Use the same for all context words.
• Use the average of cross entropy.
• word order is not important (the average does not change).
⇥1
softmax
wj
⇥2
wj+1 ⇥1
wj1wj+1
wj
⇥2
⇥2
H = log p (wj+1|wj) H =
1
T
X
t
log p (wj+t|wj)
⇥2
www.bgoncalves.com@bgoncalves
Continuous Bag of Words
• The process is essentially the same
wj+1
⇥2
⇥2
⇥1
wj
wj1
www.bgoncalves.com@bgoncalves
Variations
• Hierarchical Softmax:
• Approximate the softmax using a binary tree
• Reduce the number of calculations per training example from to and
increase performance by orders of magnitude.
• Negative Sampling:
• Under sample the most frequent words by removing them from the text before
generating the contexts
• Similar idea to removing stop-words — very frequent words are less informative.
• Effectively makes the window larger, increasing the amount of information available for
context
V log2 V
www.bgoncalves.com@bgoncalves
Comments
• word2vec, even in its original formulation is actually a family
of algorithms using various combinations of:
• Skip-gram, CBOW
• Hierarchical Softmax, Negative Sampling
• The output of this neural network is deterministic:
• If two words appear in the same context (“blue” vs “red”,
for e.g.), they will have similar internal representations in 

and
• and are vector embeddings of the input words
and the context words respectively
• Words that are too rare are also removed.
• The original implementation had a dynamic window size:
• for each word in the corpus a window size k’ is
sampled uniformly between 1 and k
⇥1
wj1wj+1
wj
⇥2
⇥2
⇥1 ⇥2
⇥1 ⇥2
www.bgoncalves.com@bgoncalves
Online resources
• C - https://code.google.com/archive/p/word2vec/ (the original one)
• Python/tensorflow - https://www.tensorflow.org/tutorials/word2vec
• Both a minimalist and an efficient versions are available in the tutorial
• Python/gensim - https://radimrehurek.com/gensim/models/word2vec.html
• Pretrained embeddings:
• 90 languages, trained using wikipedia: https://github.com/facebookresearch/fastText/blob/
master/pretrained-vectors.md
www.bgoncalves.com@bgoncalves
Analogies
• The embedding of each word is a function of the context it appears in:

• words that appear in similar contexts will have similar embeddings:
• “Distributional hypotesis” in linguistics
(red) = f (context (red))
context (red) ⇡ context (blue) =) (red) ⇡ (blue)
“You shall know a word by the company it keeps”

(J. R. Firth)
Geometrical relations
between contexts imply
semantic relations
between words!
Paris
Rome
Washington DC
Lisbon
France
PortugalItaly
USA
Capital context
Country context
(France) (Paris) + (Rome) = (Italy)
www.bgoncalves.com@bgoncalves
Visualization
www.bgoncalves.com@bgoncalves
Linguistic Change
• Train word embeddings for different years using
Google Books
• Independently trained embeddings differ by an
arbitrary rotation
• Align the different embeddings for different years
• Track the way in which the meaning of words shifted
over time!
Statistically Significant Detection of Linguistic Change
Vivek Kulkarni
Stony Brook University, USA
vvkulkarni@cs.stonybrook.edu
Rami Al-Rfou
Stony Brook University, USA
ralrfou@cs.stonybrook.edu
Bryan Perozzi
Stony Brook University, USA
bperozzi@cs.stonybrook.edu
Steven Skiena
Stony Brook University, USA
skiena@cs.stonybrook.edu
ABSTRACT
We propose a new computational approach for tracking and
detecting statistically significant linguistic shifts in the mean-
ing and usage of words. Such linguistic shifts are especially
prevalent on the Internet, where the rapid exchange of ideas
can quickly change a word’s meaning. Our meta-analysis
approach constructs property time series of word usage, and
then uses statistically sound change point detection algo-
rithms to identify significant linguistic shifts.
We consider and analyze three approaches of increasing
complexity to generate such linguistic property time series,
the culmination of which uses distributional characteristics
inferred from word co-occurrences. Using recently proposed
deep neural language models, we first train vector represen-
tations of words for each time period. Second, we warp the
vector spaces into one unified coordinate system. Finally, we
construct a distance-based distributional time series for each
word to track its linguistic displacement over time.
We demonstrate that our approach is scalable by track-
ing linguistic change across years of micro-blogging using
Twitter, a decade of product reviews using a corpus of movie
reviews from Amazon, and a century of written books using
the Google Book Ngrams. Our analysis reveals interesting
patterns of language usage change commensurate with each
medium.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval
Keywords
Web Mining;Computational Linguistics
1. INTRODUCTION
Natural languages are inherently dynamic, evolving over
time to accommodate the needs of their speakers. This
e↵ect is especially prevalent on the Internet, where the rapid
exchange of ideas can change a word’s meaning overnight.
Figure 1: A 2-dimensional projection of the latent seman-
tic space captured by our algorithm. Notice the semantic
trajectory of the word gay transitioning meaning in the space.
In this paper, we study the problem of detecting such
linguistic shifts on a variety of media including micro-blog
posts, product reviews, and books. Specifically, we seek to
detect the broadening and narrowing of semantic senses of
words, as they continually change throughout the lifetime of
a medium.
We propose the first computational approach for track-
ing and detecting statistically significant linguistic shifts of
words. To model the temporal evolution of natural language,
we construct a time series per word. We investigate three
methods to build our word time series. First, we extract
Frequency based statistics to capture sudden changes in word
usage. Second, we construct Syntactic time series by ana-
lyzing each word’s part of speech (POS) tag distribution.
Finally, we infer contextual cues from word co-occurrence
statistics to construct Distributional time series. In order to
detect and establish statistical significance of word changes
over time, we present a change point detection algorithm,
which is compatible with all methods.
Figure 1 illustrates a 2-dimensional projection of the latent
semantic space captured by our Distributional method. We
clearly observe the sequence of semantic shifts that the word
gay has undergone over the last century (1900-2005). Ini-
tially, gay was an adjective that meant cheerful or dapper.
WWW’15, 625 (2015)Statistically Significant Detection of Linguistic Change
Vivek Kulkarni
Stony Brook University, USA
vvkulkarni@cs.stonybrook.edu
Rami Al-Rfou
Stony Brook University, USA
ralrfou@cs.stonybrook.edu
Bryan Perozzi
Stony Brook University, USA
bperozzi@cs.stonybrook.edu
Steven Skiena
Stony Brook University, USA
skiena@cs.stonybrook.edu
ABSTRACT
We propose a new computational approach for tracking and
detecting statistically significant linguistic shifts in the mean-
ing and usage of words. Such linguistic shifts are especially
prevalent on the Internet, where the rapid exchange of ideas
can quickly change a word’s meaning. Our meta-analysis
approach constructs property time series of word usage, and
then uses statistically sound change point detection algo-
rithms to identify significant linguistic shifts.
We consider and analyze three approaches of increasing
complexity to generate such linguistic property time series,
the culmination of which uses distributional characteristics
inferred from word co-occurrences. Using recently proposed
deep neural language models, we first train vector represen-
tations of words for each time period. Second, we warp the
vector spaces into one unified coordinate system. Finally, we
construct a distance-based distributional time series for each
word to track its linguistic displacement over time.
We demonstrate that our approach is scalable by track-
ing linguistic change across years of micro-blogging using
Twitter, a decade of product reviews using a corpus of movie
reviews from Amazon, and a century of written books using
talkative
profligate
courageous
apparitional
dapper
sublimely
unembarrassed
courteous
sorcerers
metonymy
religious
adolescents
philanthropist
illiterate
transgendered
artisans
healthy
gays
homosexual
transgender
lesbian
statesman
hispanic
uneducated
gay1900
gay1950
gay1975
gay1990
gay2005
cheerful
Figure 1: A 2-dimensional projection of the latent seman-
tic space captured by our algorithm. Notice the semantic
trajectory of the word gay transitioning meaning in the space.
In this paper, we study the problem of detecting such
linguistic shifts on a variety of media including micro-blog
posts, product reviews, and books. Specifically, we seek to
www.bgoncalves.com@bgoncalves
node2vec
• You can generate a graph out of a sequence of words by assigning a node to each word and
connecting the words within their neighbors through edges.
• With this representation, a piece of text is a walk through the network. Then perhaps we can
invert the process? Use walks through a network to generate a sequence of nodes that can
be used to train node embeddings?
• node embeddings should capture features of the network structure and allow for detection of
similarities between nodes.
node2vec: Scalable Feature Learning for Networks
Aditya Grover
Stanford University
adityag@cs.stanford.edu
Jure Leskovec
Stanford University
jure@cs.stanford.edu
ABSTRACT
Prediction tasks over nodes and edges in networks require careful
effort in engineering features used by learning algorithms. Recent
research in the broader field of representation learning has led to
significant progress in automating prediction by learning the fea-
tures themselves. However, present feature learning approaches
are not expressive enough to capture the diversity of connectivity
patterns observed in networks.
Here we propose node2vec, an algorithmic framework for learn-
ing continuous feature representations for nodes in networks. In
node2vec, we learn a mapping of nodes to a low-dimensional space
of features that maximizes the likelihood of preserving network
neighborhoods of nodes. We define a flexible notion of a node’s
network neighborhood and design a biased random walk procedure,
which efficiently explores diverse neighborhoods. Our algorithm
generalizes prior work which is based on rigid notions of network
neighborhoods, and we argue that the added flexibility in exploring
neighborhoods is the key to learning richer representations.
We demonstrate the efficacy of node2vec over existing state-of-
the-art techniques on multi-label classification and link prediction
in several real-world networks from diverse domains. Taken to-
gether, our work represents a new way for efficiently learning state-
of-the-art task-independent representations in complex networks.
Categories and Subject Descriptors: H.2.8 [Database Manage-
ment]: Database applications—Data mining; I.2.6 [Artificial In-
telligence]: Learning
General Terms: Algorithms; Experimentation.
Keywords: Information networks, Feature learning, Node embed-
dings, Graph representations.
1. INTRODUCTION
Many important tasks in network analysis involve predictions
over nodes and edges. In a typical node classification task, we
are interested in predicting the most probable labels of nodes in
a network [33]. For example, in a social network, we might be
interested in predicting interests of users, or in a protein-protein in-
teraction network we might be interested in predicting functional
labels of proteins [25, 37]. Similarly, in link prediction, we wish to
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
predict whether a pair of nodes in a network should have an edge
connecting them [18]. Link prediction is useful in a wide variety
of domains; for instance, in genomics, it helps us discover novel
interactions between genes, and in social networks, it can identify
real-world friends [2, 34].
Any supervised machine learning algorithm requires a set of in-
formative, discriminating, and independent features. In prediction
problems on networks this means that one has to construct a feature
vector representation for the nodes and edges. A typical solution in-
volves hand-engineering domain-specific features based on expert
knowledge. Even if one discounts the tedious effort required for
feature engineering, such features are usually designed for specific
tasks and do not generalize across different prediction tasks.
An alternative approach is to learn feature representations by
solving an optimization problem [4]. The challenge in feature learn-
ing is defining an objective function, which involves a trade-off
in balancing computational efficiency and predictive accuracy. On
one side of the spectrum, one could directly aim to find a feature
representation that optimizes performance of a downstream predic-
tion task. While this supervised procedure results in good accu-
racy, it comes at the cost of high training time complexity due to a
blowup in the number of parameters that need to be estimated. At
the other extreme, the objective function can be defined to be inde-
pendent of the downstream prediction task and the representations
can be learned in a purely unsupervised way. This makes the op-
timization computationally efficient and with a carefully designed
objective, it results in task-independent features that closely match
task-specific approaches in predictive accuracy [21, 23].
However, current techniques fail to satisfactorily define and opti-
mize a reasonable objective required for scalable unsupervised fea-
ture learning in networks. Classic approaches based on linear and
non-linear dimensionality reduction techniques such as Principal
Component Analysis, Multi-Dimensional Scaling and their exten-
sions [3, 27, 30, 35] optimize an objective that transforms a repre-
sentative data matrix of the network such that it maximizes the vari-
ance of the data representation. Consequently, these approaches in-
variably involve eigendecomposition of the appropriate data matrix
which is expensive for large real-world networks. Moreover, the
resulting latent representations give poor performance on various
prediction tasks over networks.
Alternatively, we can design an objective that seeks to preserve
local neighborhoods of nodes. The objective can be efficiently op-
arXiv:1607.00653v1[cs.SI]3Jul2016
KDD’16, 855 (2016)
www.bgoncalves.com@bgoncalves
node2vec
• The features depends strongly on the way in which the network is
traversed
• Generate the contexts for each node using Breath First Search and
Depth First Search







• Perform a biased Random Walk
KDD’16, 855 (2016)
n communities they belong to (i.e., ho-
he organization could be based on the
n the network (i.e., structural equiva-
tance, in Figure 1, we observe nodes
ame tightly knit community of nodes,
the two distinct communities share the
node. Real-world networks commonly
uivalences. Thus, it is essential to allow
can learn node representations obeying
earn representations that embed nodes
mmunity closely together, as well as to
nodes that share similar roles have sim-
d allow feature learning algorithms to
iety of domains and prediction tasks.
node2vec, a semi-supervised algorithm
g in networks. We optimize a custom
on using SGD motivated by prior work
ing [21]. Intuitively, our approach re-
s that maximize the likelihood of pre-
oods of nodes in a d-dimensional fea-
der random walk approach to generate
hoods for nodes.
n defining a flexible notion of a node’s
y choosing an appropriate notion of a
an learn representations that organize
ork roles and/or communities they be-
u
s3
s2
s1
s4
s8
s9
s6
s7
s5
BFS
DFS
Figure 1: BFS and DFS search strategies from node u (k = 3).
principles in network science, providing flexibility in discov-
ering representations conforming to different equivalences.
3. We extend node2vec and other feature learning methods based
on neighborhood preserving objectives, from nodes to pairs
of nodes for edge-based prediction tasks.
4. We empirically evaluate node2vec for multi-label classifica-
tion and link prediction on several real-world datasets.
The rest of the paper is structured as follows. In Section 2, we
briefly survey related work in feature learning for networks. We
present the technical details for feature learning using node2vec
in Section 3. In Section 4, we empirically evaluate node2vec on
prediction tasks over nodes and edges on various real-world net-
works and assess the parameter sensitivity, perturbation analysis,
and scalability aspects of our algorithm. We conclude with a dis-
cussion of the node2vec framework and highlight some promis-
ing directions for future work in Section 5. Datasets and a refer-
ain structural equivalence, it is of-
e local neighborhoods accurately.
nce based on network roles such as
d just by observing the immediate
restricting search to nearby nodes,
on and obtains a microscopic view
ode. Additionally, in BFS, nodes in
d to repeat many times. This is also
ance in characterizing the distribu-
t the source node. However, a very
plored for any given k.
which can explore larger parts of
t
x2
x1
v
x3
α=1
α=1/q
α=1/q
α=1/p
v
α=1
α=1/q
α=1/q
α=1/p
x2
x3t
x1
Figure 2: Illustration of the random walk procedure in node2vec.
The walk just transitioned from t to v and is now evaluating its next
• BFS - Explores only
limited neighborhoods.
Suitable for structural
equivalences
• DFS - Freely explores
neighborhoods and
covers homophiles
communities
• By modifying the
parameter of the model it
can interpolate between
the BFS and DFS
extremes
www.bgoncalves.com@bgoncalves
dna2vec
• Separate the genome into long non-overlapping DNS fragments.
• Convert long DNA fragments into overlapping variable length k-mers
• Train embeddings of each k-mer using Gensim implementation of SkipGram.
• Summing embeddings is related to concatenating k-mers
• Cosign similarity of k-mer embeddings reproduces a biologically motivated
similarity score (Needleman-Wunsch) that is used to align nucleoti
dna2vec: Consistent vector representations of
variable-length k-mers
Patrick Ng
ppn3@cs.cornell.edu
Abstract
One of the ubiquitous representation of long DNA sequence is dividing it into shorter k-mer components.
Unfortunately, the straightforward vector encoding of k-mer as a one-hot vector is vulnerable to the
curse of dimensionality. Worse yet, the distance between any pair of one-hot vectors is equidistant. This
is particularly problematic when applying the latest machine learning algorithms to solve problems in
biological sequence analysis. In this paper, we propose a novel method to train distributed representations
of variable-length k-mers. Our method is based on the popular word embedding model word2vec, which
is trained on a shallow two-layer neural network. Our experiments provide evidence that the summing
of dna2vec vectors is akin to nucleotides concatenation. We also demonstrate that there is correlation
between Needleman-Wunsch similarity score and cosine similarity of dna2vec vectors.
1 Introduction
The usage of k-mer representation has been a popular approach in analyzing long sequence of DNA fragments.
The k-mer representation is simple to understand and compute. Unfortunately, its straightforward vector
encoding as a one-hot vector (i.e. bit vector that consists of all zeros except for a single dimension) is
vulnerable to curse of dimensionality. Specifically, its one-hot vector has dimension exponential to the length
of k. For example, an 8-mer needs a bit vector of dimension 48
= 65536. This is problematic when applying
the latest machine learning algorithms to solve problems in biological sequence analysis, due to the fact that
most of these tools prefer lower-dimensional continuous vectors as input (Suykens and Vandewalle, 1999;
Angermueller et al., 2016; Turian et al., 2010). Worse yet, the distance between any arbitrary pair of one-hot
vectors is equidistant, even though ATGGC should be closer to ATGGG than CACGA.
1.1 Word embeddings
The Natural Language Processing (NLP) research community has a long tradition of using bag-of-words with
one-hot vector, where its dimension is equal to the vocabulary size. Recently, there has been an explosion of
using word embeddings as inputs to machine learning algorithms, especially in the deep learning community
(Mikolov et al., 2013b; LeCun et al., 2015; Bengio et al., 2013). Word embeddings are vectors of real numbers
that are distributed representations of words.
A popular training technique for word embeddings, word2vec (Mikolov et al., 2013a), consists of using a
2-layer neural network that is trained on the current word and its surrounding context words (see Section
2.3). This reconstruction of context of words is loosely inspired by the linguistic concept of distributional
hypothesis, which states that words that appear in the same context have similar meaning (Harris, 1954).
Deep learning algorithms applied with word embeddings have had dramatic improvements in the areas of
machine translation (Sutskever et al., 2014; Bahdanau et al., 2014; Cho et al., 2014), summarization (Chopra
arXiv:1701.06279v1[q-bio.QM]23Jan2017
arXiv: 1701.06279 (2017)
www.bgoncalves.com@bgoncalves
http://tinyletter.com/dataforscience

More Related Content

What's hot

What's hot (20)

Science in text mining
Science in text miningScience in text mining
Science in text mining
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practice
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 

Viewers also liked

Viewers also liked (7)

Twitterology - The Science of Twitter
Twitterology - The Science of TwitterTwitterology - The Science of Twitter
Twitterology - The Science of Twitter
 
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) Learning
 
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing EconomyMining Geo-referenced Data: Location-based Services and the Sharing Economy
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
 
Human Mobility (with Mobile Devices)
Human Mobility (with Mobile Devices)Human Mobility (with Mobile Devices)
Human Mobility (with Mobile Devices)
 
Mining Georeferenced Data
Mining Georeferenced DataMining Georeferenced Data
Mining Georeferenced Data
 
Complenet 2017
Complenet 2017Complenet 2017
Complenet 2017
 
Machine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksMachine(s) Learning with Neural Networks
Machine(s) Learning with Neural Networks
 

Similar to Word2vec and Friends

02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
[Emnlp] what is glo ve part ii - towards data science
[Emnlp] what is glo ve  part ii - towards data science[Emnlp] what is glo ve  part ii - towards data science
[Emnlp] what is glo ve part ii - towards data science
Nikhil Jaiswal
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data science
Nikhil Jaiswal
 
[Emnlp] what is glo ve part iii - towards data science
[Emnlp] what is glo ve  part iii - towards data science[Emnlp] what is glo ve  part iii - towards data science
[Emnlp] what is glo ve part iii - towards data science
Nikhil Jaiswal
 

Similar to Word2vec and Friends (20)

A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligence
 
Generating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksGenerating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural Networks
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
[Emnlp] what is glo ve part ii - towards data science
[Emnlp] what is glo ve  part ii - towards data science[Emnlp] what is glo ve  part ii - towards data science
[Emnlp] what is glo ve part ii - towards data science
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data science
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
Word vectors
Word vectorsWord vectors
Word vectors
 
Word embedding
Word embedding Word embedding
Word embedding
 
probability.pptx
probability.pptxprobability.pptx
probability.pptx
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
 
[Emnlp] what is glo ve part iii - towards data science
[Emnlp] what is glo ve  part iii - towards data science[Emnlp] what is glo ve  part iii - towards data science
[Emnlp] what is glo ve part iii - towards data science
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
 

Recently uploaded

Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 

Word2vec and Friends

  • 2. www.bgoncalves.com@bgoncalves • Computers are really good at crunching numbers but not so much when it comes to words. • Perhaps can we represent words numerically?
 
 • Can we do it in a way that preserves semantic information?
 
 
 • Words that have similar meanings are used in similar contexts and the context in which a word is used helps us understand it’s meaning. Teaching machines to read! The red house is beautiful.
 The blue house is old.
 The red car is beautiful.
 The blue car is old. “You shall know a word by the company it keeps”
 (J. R. Firth) a 1 about 2 above 3 after 4 again 5 against 6 all 7 am 8 an 9 and 10 any 11 are 12 aren't 13 as 14 … … vafter = (0, 0, 0, 1, 0, 0, · · · ) T vabove = (0, 0, 1, 0, 0, 0, · · · ) T One-hot encoding
  • 3. www.bgoncalves.com@bgoncalves Teaching machines to read! ➡Words with similar meanings should have similar representations. ➡From a word we can get some idea about the context where it might appear 
 
 ➡And from the context we have some idea about possible words “You shall know a word by the company it keeps”
 (J. R. Firth) max p (C|w) max p (w|C) The red _____ is beautiful.
 The blue _____ is old. ___ ___ house __ ____.
 ___ ___ car __ _______.

  • 4. www.bgoncalves.com@bgoncalves word2vec max p (C|w) max p (w|C) Skipgram Continuous Bag of Words ⇥1 wj1wj+1 wj ⇥2 ⇥2 wj+1 ⇥2 ⇥2 ⇥1 wj wj1 Word Context Context Word ⇥1 ⇥2 wj word embeddings context embeddings one hot vector activation function Mikolov 2013
  • 5. www.bgoncalves.com@bgoncalves • Let us take a better look at a simplified case with a single context word • Words are one-hot encoded vectors of length V • is an matrix so that when we take the product: • We are effectively selecting the j’th column of : • The linear activation function simply passes this value along
 which is then multiplied by , a matrix. • Each element k of the output layer its then given by: • We convert these values to a normalized probability distribution by using the softmax Skipgram ⇥1 softmax wj ⇥2 wj+1 wj = (0, 0, 1, 0, 0, 0, · · · ) T ⇥1 (M ⇥ V ) ⇥1 · wj ⇥1 vj = ⇥1 · wj ⇥2 (V ⇥ M) uT k · vj
  • 6. www.bgoncalves.com@bgoncalves • A standard way of converting a set of number to a normalized probability distribution:
 
 • With this final ingredient we obtain:
 
 • Our goal is then to learn: • so that we can predict what the next word is likely to be using:
 • But how can we quantify how far we are from the correct answer? Our error measure shouldn’t be just binary (right or wrong)… Softmax ⇥1 softmax wj ⇥2 wj+1 softmax (x) = exp (xj) P l exp (xl) p (wk|wj) ⌘ softmax uT k · vj = exp uT k · vj P l exp uT l · vj p (wj+1|wj) ⇥1 ⇥2
  • 7. www.bgoncalves.com@bgoncalves Cross-Entropy • First we have to recall that what we are, in effect, comparing two probability distributions: • and the one-hot encoding of the context: • The Cross Entropy measures the distance, in number of bits, between two probability distributions p and q: 
 • In our case, this becomes: • So it’s clear that the only non zero term is the one that corresponds to the “hot” element of 
 • This is our Error function. But how can we use this to update the values of and ? p (wk|wj) H (p, q) = X k pk log qk H = log p (wj+1|wj) wj+1 wj+1 = (0, 0, 0, 1, 0, 0, · · · ) T H [wj+1, p (wk|wj)] = X k wk j+1 log p (wk|wj) ⇥1 ⇥2
  • 8. www.bgoncalves.com@bgoncalves Gradient Descent • Find the gradient for each training batch • Take a step downhill along the direction of the gradient 
 
 • where is the step size. • Repeat until “convergence”. H ✓mn ✓mn ↵ @H @✓mn @H @✓mn ↵
  • 9. www.bgoncalves.com@bgoncalves Chain-rule • How can we calculate
 • we rewrite:
 • and expand: • Then we can rewrite:
 • and apply the chain rule: @H @✓mn = @ @✓mn log p (wj+1|wj) @H @✓mn = @ @✓mn log exp uT k · vj P l exp uT l · vj uT k · vj = X q ✓ (2) kq ✓ (1) qj @f (g (x)) @x = @f (g (x)) @g (x) @g (x) @x @H @✓mn = @ @✓mn " uT k · vj log X l exp uT l · vj # ✓mn = n ✓(1) mn, ✓(2) mn o
  • 10. www.bgoncalves.com@bgoncalves SkipGram with Larger Contexts • Use the same for all context words. • Use the average of cross entropy. • word order is not important (the average does not change). ⇥1 softmax wj ⇥2 wj+1 ⇥1 wj1wj+1 wj ⇥2 ⇥2 H = log p (wj+1|wj) H = 1 T X t log p (wj+t|wj) ⇥2
  • 11. www.bgoncalves.com@bgoncalves Continuous Bag of Words • The process is essentially the same wj+1 ⇥2 ⇥2 ⇥1 wj wj1
  • 12. www.bgoncalves.com@bgoncalves Variations • Hierarchical Softmax: • Approximate the softmax using a binary tree • Reduce the number of calculations per training example from to and increase performance by orders of magnitude. • Negative Sampling: • Under sample the most frequent words by removing them from the text before generating the contexts • Similar idea to removing stop-words — very frequent words are less informative. • Effectively makes the window larger, increasing the amount of information available for context V log2 V
  • 13. www.bgoncalves.com@bgoncalves Comments • word2vec, even in its original formulation is actually a family of algorithms using various combinations of: • Skip-gram, CBOW • Hierarchical Softmax, Negative Sampling • The output of this neural network is deterministic: • If two words appear in the same context (“blue” vs “red”, for e.g.), they will have similar internal representations in 
 and • and are vector embeddings of the input words and the context words respectively • Words that are too rare are also removed. • The original implementation had a dynamic window size: • for each word in the corpus a window size k’ is sampled uniformly between 1 and k ⇥1 wj1wj+1 wj ⇥2 ⇥2 ⇥1 ⇥2 ⇥1 ⇥2
  • 14. www.bgoncalves.com@bgoncalves Online resources • C - https://code.google.com/archive/p/word2vec/ (the original one) • Python/tensorflow - https://www.tensorflow.org/tutorials/word2vec • Both a minimalist and an efficient versions are available in the tutorial • Python/gensim - https://radimrehurek.com/gensim/models/word2vec.html • Pretrained embeddings: • 90 languages, trained using wikipedia: https://github.com/facebookresearch/fastText/blob/ master/pretrained-vectors.md
  • 15. www.bgoncalves.com@bgoncalves Analogies • The embedding of each word is a function of the context it appears in:
 • words that appear in similar contexts will have similar embeddings: • “Distributional hypotesis” in linguistics (red) = f (context (red)) context (red) ⇡ context (blue) =) (red) ⇡ (blue) “You shall know a word by the company it keeps”
 (J. R. Firth) Geometrical relations between contexts imply semantic relations between words! Paris Rome Washington DC Lisbon France PortugalItaly USA Capital context Country context (France) (Paris) + (Rome) = (Italy)
  • 17. www.bgoncalves.com@bgoncalves Linguistic Change • Train word embeddings for different years using Google Books • Independently trained embeddings differ by an arbitrary rotation • Align the different embeddings for different years • Track the way in which the meaning of words shifted over time! Statistically Significant Detection of Linguistic Change Vivek Kulkarni Stony Brook University, USA vvkulkarni@cs.stonybrook.edu Rami Al-Rfou Stony Brook University, USA ralrfou@cs.stonybrook.edu Bryan Perozzi Stony Brook University, USA bperozzi@cs.stonybrook.edu Steven Skiena Stony Brook University, USA skiena@cs.stonybrook.edu ABSTRACT We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the mean- ing and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word’s meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algo- rithms to identify significant linguistic shifts. We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector represen- tations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track its linguistic displacement over time. We demonstrate that our approach is scalable by track- ing linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book Ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval Keywords Web Mining;Computational Linguistics 1. INTRODUCTION Natural languages are inherently dynamic, evolving over time to accommodate the needs of their speakers. This e↵ect is especially prevalent on the Internet, where the rapid exchange of ideas can change a word’s meaning overnight. Figure 1: A 2-dimensional projection of the latent seman- tic space captured by our algorithm. Notice the semantic trajectory of the word gay transitioning meaning in the space. In this paper, we study the problem of detecting such linguistic shifts on a variety of media including micro-blog posts, product reviews, and books. Specifically, we seek to detect the broadening and narrowing of semantic senses of words, as they continually change throughout the lifetime of a medium. We propose the first computational approach for track- ing and detecting statistically significant linguistic shifts of words. To model the temporal evolution of natural language, we construct a time series per word. We investigate three methods to build our word time series. First, we extract Frequency based statistics to capture sudden changes in word usage. Second, we construct Syntactic time series by ana- lyzing each word’s part of speech (POS) tag distribution. Finally, we infer contextual cues from word co-occurrence statistics to construct Distributional time series. In order to detect and establish statistical significance of word changes over time, we present a change point detection algorithm, which is compatible with all methods. Figure 1 illustrates a 2-dimensional projection of the latent semantic space captured by our Distributional method. We clearly observe the sequence of semantic shifts that the word gay has undergone over the last century (1900-2005). Ini- tially, gay was an adjective that meant cheerful or dapper. WWW’15, 625 (2015)Statistically Significant Detection of Linguistic Change Vivek Kulkarni Stony Brook University, USA vvkulkarni@cs.stonybrook.edu Rami Al-Rfou Stony Brook University, USA ralrfou@cs.stonybrook.edu Bryan Perozzi Stony Brook University, USA bperozzi@cs.stonybrook.edu Steven Skiena Stony Brook University, USA skiena@cs.stonybrook.edu ABSTRACT We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the mean- ing and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word’s meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algo- rithms to identify significant linguistic shifts. We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector represen- tations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track its linguistic displacement over time. We demonstrate that our approach is scalable by track- ing linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using talkative profligate courageous apparitional dapper sublimely unembarrassed courteous sorcerers metonymy religious adolescents philanthropist illiterate transgendered artisans healthy gays homosexual transgender lesbian statesman hispanic uneducated gay1900 gay1950 gay1975 gay1990 gay2005 cheerful Figure 1: A 2-dimensional projection of the latent seman- tic space captured by our algorithm. Notice the semantic trajectory of the word gay transitioning meaning in the space. In this paper, we study the problem of detecting such linguistic shifts on a variety of media including micro-blog posts, product reviews, and books. Specifically, we seek to
  • 18. www.bgoncalves.com@bgoncalves node2vec • You can generate a graph out of a sequence of words by assigning a node to each word and connecting the words within their neighbors through edges. • With this representation, a piece of text is a walk through the network. Then perhaps we can invert the process? Use walks through a network to generate a sequence of nodes that can be used to train node embeddings? • node embeddings should capture features of the network structure and allow for detection of similarities between nodes. node2vec: Scalable Feature Learning for Networks Aditya Grover Stanford University adityag@cs.stanford.edu Jure Leskovec Stanford University jure@cs.stanford.edu ABSTRACT Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the fea- tures themselves. However, present feature learning approaches are not expressive enough to capture the diversity of connectivity patterns observed in networks. Here we propose node2vec, an algorithmic framework for learn- ing continuous feature representations for nodes in networks. In node2vec, we learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. We define a flexible notion of a node’s network neighborhood and design a biased random walk procedure, which efficiently explores diverse neighborhoods. Our algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and we argue that the added flexibility in exploring neighborhoods is the key to learning richer representations. We demonstrate the efficacy of node2vec over existing state-of- the-art techniques on multi-label classification and link prediction in several real-world networks from diverse domains. Taken to- gether, our work represents a new way for efficiently learning state- of-the-art task-independent representations in complex networks. Categories and Subject Descriptors: H.2.8 [Database Manage- ment]: Database applications—Data mining; I.2.6 [Artificial In- telligence]: Learning General Terms: Algorithms; Experimentation. Keywords: Information networks, Feature learning, Node embed- dings, Graph representations. 1. INTRODUCTION Many important tasks in network analysis involve predictions over nodes and edges. In a typical node classification task, we are interested in predicting the most probable labels of nodes in a network [33]. For example, in a social network, we might be interested in predicting interests of users, or in a protein-protein in- teraction network we might be interested in predicting functional labels of proteins [25, 37]. Similarly, in link prediction, we wish to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed predict whether a pair of nodes in a network should have an edge connecting them [18]. Link prediction is useful in a wide variety of domains; for instance, in genomics, it helps us discover novel interactions between genes, and in social networks, it can identify real-world friends [2, 34]. Any supervised machine learning algorithm requires a set of in- formative, discriminating, and independent features. In prediction problems on networks this means that one has to construct a feature vector representation for the nodes and edges. A typical solution in- volves hand-engineering domain-specific features based on expert knowledge. Even if one discounts the tedious effort required for feature engineering, such features are usually designed for specific tasks and do not generalize across different prediction tasks. An alternative approach is to learn feature representations by solving an optimization problem [4]. The challenge in feature learn- ing is defining an objective function, which involves a trade-off in balancing computational efficiency and predictive accuracy. On one side of the spectrum, one could directly aim to find a feature representation that optimizes performance of a downstream predic- tion task. While this supervised procedure results in good accu- racy, it comes at the cost of high training time complexity due to a blowup in the number of parameters that need to be estimated. At the other extreme, the objective function can be defined to be inde- pendent of the downstream prediction task and the representations can be learned in a purely unsupervised way. This makes the op- timization computationally efficient and with a carefully designed objective, it results in task-independent features that closely match task-specific approaches in predictive accuracy [21, 23]. However, current techniques fail to satisfactorily define and opti- mize a reasonable objective required for scalable unsupervised fea- ture learning in networks. Classic approaches based on linear and non-linear dimensionality reduction techniques such as Principal Component Analysis, Multi-Dimensional Scaling and their exten- sions [3, 27, 30, 35] optimize an objective that transforms a repre- sentative data matrix of the network such that it maximizes the vari- ance of the data representation. Consequently, these approaches in- variably involve eigendecomposition of the appropriate data matrix which is expensive for large real-world networks. Moreover, the resulting latent representations give poor performance on various prediction tasks over networks. Alternatively, we can design an objective that seeks to preserve local neighborhoods of nodes. The objective can be efficiently op- arXiv:1607.00653v1[cs.SI]3Jul2016 KDD’16, 855 (2016)
  • 19. www.bgoncalves.com@bgoncalves node2vec • The features depends strongly on the way in which the network is traversed • Generate the contexts for each node using Breath First Search and Depth First Search
 
 
 
 • Perform a biased Random Walk KDD’16, 855 (2016) n communities they belong to (i.e., ho- he organization could be based on the n the network (i.e., structural equiva- tance, in Figure 1, we observe nodes ame tightly knit community of nodes, the two distinct communities share the node. Real-world networks commonly uivalences. Thus, it is essential to allow can learn node representations obeying earn representations that embed nodes mmunity closely together, as well as to nodes that share similar roles have sim- d allow feature learning algorithms to iety of domains and prediction tasks. node2vec, a semi-supervised algorithm g in networks. We optimize a custom on using SGD motivated by prior work ing [21]. Intuitively, our approach re- s that maximize the likelihood of pre- oods of nodes in a d-dimensional fea- der random walk approach to generate hoods for nodes. n defining a flexible notion of a node’s y choosing an appropriate notion of a an learn representations that organize ork roles and/or communities they be- u s3 s2 s1 s4 s8 s9 s6 s7 s5 BFS DFS Figure 1: BFS and DFS search strategies from node u (k = 3). principles in network science, providing flexibility in discov- ering representations conforming to different equivalences. 3. We extend node2vec and other feature learning methods based on neighborhood preserving objectives, from nodes to pairs of nodes for edge-based prediction tasks. 4. We empirically evaluate node2vec for multi-label classifica- tion and link prediction on several real-world datasets. The rest of the paper is structured as follows. In Section 2, we briefly survey related work in feature learning for networks. We present the technical details for feature learning using node2vec in Section 3. In Section 4, we empirically evaluate node2vec on prediction tasks over nodes and edges on various real-world net- works and assess the parameter sensitivity, perturbation analysis, and scalability aspects of our algorithm. We conclude with a dis- cussion of the node2vec framework and highlight some promis- ing directions for future work in Section 5. Datasets and a refer- ain structural equivalence, it is of- e local neighborhoods accurately. nce based on network roles such as d just by observing the immediate restricting search to nearby nodes, on and obtains a microscopic view ode. Additionally, in BFS, nodes in d to repeat many times. This is also ance in characterizing the distribu- t the source node. However, a very plored for any given k. which can explore larger parts of t x2 x1 v x3 α=1 α=1/q α=1/q α=1/p v α=1 α=1/q α=1/q α=1/p x2 x3t x1 Figure 2: Illustration of the random walk procedure in node2vec. The walk just transitioned from t to v and is now evaluating its next • BFS - Explores only limited neighborhoods. Suitable for structural equivalences • DFS - Freely explores neighborhoods and covers homophiles communities • By modifying the parameter of the model it can interpolate between the BFS and DFS extremes
  • 20. www.bgoncalves.com@bgoncalves dna2vec • Separate the genome into long non-overlapping DNS fragments. • Convert long DNA fragments into overlapping variable length k-mers • Train embeddings of each k-mer using Gensim implementation of SkipGram. • Summing embeddings is related to concatenating k-mers • Cosign similarity of k-mer embeddings reproduces a biologically motivated similarity score (Needleman-Wunsch) that is used to align nucleoti dna2vec: Consistent vector representations of variable-length k-mers Patrick Ng ppn3@cs.cornell.edu Abstract One of the ubiquitous representation of long DNA sequence is dividing it into shorter k-mer components. Unfortunately, the straightforward vector encoding of k-mer as a one-hot vector is vulnerable to the curse of dimensionality. Worse yet, the distance between any pair of one-hot vectors is equidistant. This is particularly problematic when applying the latest machine learning algorithms to solve problems in biological sequence analysis. In this paper, we propose a novel method to train distributed representations of variable-length k-mers. Our method is based on the popular word embedding model word2vec, which is trained on a shallow two-layer neural network. Our experiments provide evidence that the summing of dna2vec vectors is akin to nucleotides concatenation. We also demonstrate that there is correlation between Needleman-Wunsch similarity score and cosine similarity of dna2vec vectors. 1 Introduction The usage of k-mer representation has been a popular approach in analyzing long sequence of DNA fragments. The k-mer representation is simple to understand and compute. Unfortunately, its straightforward vector encoding as a one-hot vector (i.e. bit vector that consists of all zeros except for a single dimension) is vulnerable to curse of dimensionality. Specifically, its one-hot vector has dimension exponential to the length of k. For example, an 8-mer needs a bit vector of dimension 48 = 65536. This is problematic when applying the latest machine learning algorithms to solve problems in biological sequence analysis, due to the fact that most of these tools prefer lower-dimensional continuous vectors as input (Suykens and Vandewalle, 1999; Angermueller et al., 2016; Turian et al., 2010). Worse yet, the distance between any arbitrary pair of one-hot vectors is equidistant, even though ATGGC should be closer to ATGGG than CACGA. 1.1 Word embeddings The Natural Language Processing (NLP) research community has a long tradition of using bag-of-words with one-hot vector, where its dimension is equal to the vocabulary size. Recently, there has been an explosion of using word embeddings as inputs to machine learning algorithms, especially in the deep learning community (Mikolov et al., 2013b; LeCun et al., 2015; Bengio et al., 2013). Word embeddings are vectors of real numbers that are distributed representations of words. A popular training technique for word embeddings, word2vec (Mikolov et al., 2013a), consists of using a 2-layer neural network that is trained on the current word and its surrounding context words (see Section 2.3). This reconstruction of context of words is loosely inspired by the linguistic concept of distributional hypothesis, which states that words that appear in the same context have similar meaning (Harris, 1954). Deep learning algorithms applied with word embeddings have had dramatic improvements in the areas of machine translation (Sutskever et al., 2014; Bahdanau et al., 2014; Cho et al., 2014), summarization (Chopra arXiv:1701.06279v1[q-bio.QM]23Jan2017 arXiv: 1701.06279 (2017)