word2vec and friends


www.bgoncalves.com
https://github.com/bmtgoncalves/word2vec-and-friends/
www.bgoncalves.com@bgoncalves
• Computers are really good at crunching numbers but not so much when it
comes to words.
• Perhaps can we represent words numerically?



Teaching machines to read!
a 1
about 2
above 3
after 4
again 5
against 6
all 7
am 8
an 9
and 10
any 11
are 12
aren't 13
as 14
… …
www.bgoncalves.com@bgoncalves
• Computers are really good at crunching numbers but not so much when it
comes to words.
• Perhaps can we represent words numerically?



• Can we do it in a way that preserves semantic information?





• Words that have similar meanings are used in similar contexts and the context
in which a word is used helps us understand it’s meaning.
Teaching machines to read!
The red house is beautiful.

The blue house is old.

The red car is beautiful.

The blue car is old.
“You shall know a word by the company it keeps”

(J. R. Firth)
vafter = (0, 0, 0, 1, 0, 0, · · · )
T
vabove = (0, 0, 1, 0, 0, 0, · · · )
T
One-hot
encoding
www.bgoncalves.com@bgoncalves
Teaching machines to read!
➡Words with similar meanings should have similar representations.
➡From a word we can get some idea about the context where it might appear 



➡And from the context we have some idea about possible words
“You shall know a word by the company it keeps”

(J. R. Firth)
max p (C|w)
max p (w|C)
The red _____ is beautiful.

The blue _____ is old.
___ ___ house __ ____.

___ ___ car __ _______.

www.bgoncalves.com@bgoncalves
word2vec
max p (C|w) max p (w|C)
Skipgram Continuous Bag of Words
⇥1
wj1wj+1
wj
⇥2
⇥2
wj+1
⇥2
⇥2
⇥1
wj
wj1
Word Context Context Word
⇥1
⇥2
wj
word embeddings
context embeddings
one hot vector
activation function
Mikolov 2013
www.bgoncalves.com@bgoncalves
• Let us take a better look at a simplified case with a single context word
• Words are one-hot encoded vectors of length V
• is an matrix so that when we take the product:
• We are effectively selecting the j’th column of :
• The linear activation function simply passes this value along

which is then multiplied by , a matrix.
• Each element k of the output layer its then given by:
• We convert these values to a normalized probability distribution by using the softmax
Skipgram
⇥1
softmax
wj
⇥2
wj+1
wj = (0, 0, 1, 0, 0, 0, · · · )
T
⇥1 (M ⇥ V )
⇥1 · wj
⇥1
vj = ⇥1 · wj
⇥2 (V ⇥ M)
uT
k · vj
www.bgoncalves.com@bgoncalves
• A standard way of converting a set of number to a normalized probability distribution:



• With this final ingredient we obtain:



• Our goal is then to learn:
• so that we can predict what the next word is likely to be using:

• But how can we quantify how far we are from the correct answer? Our error measure
shouldn’t be just binary (right or wrong)…
Softmax
⇥1
softmax
wj
⇥2
wj+1
softmax (x) =
exp (xj)
P
l exp (xl)
p (wk|wj) ⌘ softmax uT
k · vj =
exp uT
k · vj
P
l exp uT
l · vj
p (wj+1|wj)
⇥1 ⇥2
www.bgoncalves.com@bgoncalves
Cross-Entropy
• First we have to recall that what we are, in effect, comparing two probability distributions:
• and the one-hot encoding of the context:
• The Cross Entropy measures the distance, in number of bits, between two probability
distributions p and q: 

• In our case, this becomes:
• So it’s clear that the only non zero term is the one that corresponds to the “hot” element of 

• This is our Error function. But how can we use this to update the values of and ?
p (wk|wj)
H (p, q) =
X
k
pk log qk
H = log p (wj+1|wj)
wj+1
wj+1 = (0, 0, 0, 1, 0, 0, · · · )
T
H [wj+1, p (wk|wj)] =
X
k
wk
j+1 log p (wk|wj)
⇥1 ⇥2
www.bgoncalves.com@bgoncalves
Gradient Descent
• Find the gradient for each training batch
• Take a step downhill along the
direction of the gradient 



• where is the step size.
• Repeat until “convergence”.
H
✓mn ✓mn ↵
@H
@✓mn
@H
@✓mn
↵
www.bgoncalves.com@bgoncalves
Chain-rule
• How can we calculate

• we rewrite:

• and expand:
• Then we can rewrite:

• and apply the chain rule:
@H
@✓mn
=
@
@✓mn
log p (wj+1|wj)
@H
@✓mn
=
@
@✓mn
log
exp uT
k · vj
P
l exp uT
l · vj
uT
k · vj =
X
q
✓
(2)
kq ✓
(1)
qj
@f (g (x))
@x
=
@f (g (x))
@g (x)
@g (x)
@x
@H
@✓mn
=
@
@✓mn
"
uT
k · vj log
X
l
exp uT
l · vj
#
✓mn =
n
✓(1)
mn, ✓(2)
mn
o
www.bgoncalves.com@bgoncalves
Training procedures
• online learning - update weights after each case

- might be useful to update model as new data is obtained

- subject to fluctuations
• mini-batch - update weights after a “small” number of cases

- batches should be balanced

- if dataset is redundant, the gradient estimated using only a fraction of the data 

is a good approximation to the full gradient.
• momentum - let gradient change the velocity of weight change instead of the value directly
• rmsprop - divide learning rate for each weight by a running average of “recent” gradients
• learning rate - vary over the course of the training procedure and use different learning rates

for each weight
www.bgoncalves.com@bgoncalves
SkipGram with Larger Contexts
• Use the same for all context words.
• Use the average of cross entropy.
• word order is not important (the average does not change)
• Can essentially be trained one context word at at time..
⇥1
softmax
wj
⇥2
wj+1 ⇥1
wj1wj+1
wj
⇥2
⇥2
H = log p (wj+1|wj) H =
1
T
X
t
log p (wj+t|wj)
⇥2
www.bgoncalves.com@bgoncalves
Continuous Bag of Words
• The process is essentially the same
wj+1
⇥2
⇥2
⇥1
wj
wj1
www.bgoncalves.com@bgoncalves
Variations
• Hierarchical Softmax:
• Approximate the softmax using a binary tree
• Reduce the number of calculations per training example from to and
increase performance by orders of magnitude.
• Negative Sampling:
• Under sample the most frequent words by removing them from the text before
generating the contexts
• Similar idea to removing stop-words — very frequent words are less informative.
• Effectively makes the window larger, increasing the amount of information available for
context
V log2 V
www.bgoncalves.com@bgoncalves
Comments
• word2vec, even in its original formulation is actually a family
of algorithms using various combinations of:
• Skip-gram, CBOW
• Hierarchical Softmax, Negative Sampling
• The output of this neural network is deterministic:
• If two words appear in the same context (“blue” vs “red”,
for e.g.), they will have similar internal representations in 

and
• and are vector embeddings of the input words
and the context words respectively
• Words that are too rare are also removed.
• The original implementation had a dynamic window size:
• for each word in the corpus a window size k’ is
sampled uniformly between 1 and k
⇥1
wj1wj+1
wj
⇥2
⇥2
⇥1 ⇥2
⇥1 ⇥2
www.bgoncalves.com@bgoncalves
Online resources
• C - https://code.google.com/archive/p/word2vec/ (the original one)
• Python/tensorflow - https://www.tensorflow.org/tutorials/word2vec
• Both a minimalist and an efficient versions are available in the tutorial
• Python/gensim - https://radimrehurek.com/gensim/models/word2vec.html
• Pretrained embeddings:
• 30+ languages, https://github.com/Kyubyong/wordvectors
• 100+ languages trained using wikipedia: https://sites.google.com/site/rmyeid/projects/polyglot
www.bgoncalves.com@bgoncalves
Visualization
www.bgoncalves.com@bgoncalves
Visualization
www.bgoncalves.com@bgoncalves
Visualization
www.bgoncalves.com@bgoncalves
Visualization
www.bgoncalves.com@bgoncalves
Visualization
www.bgoncalves.com@bgoncalves
Visualization
www.bgoncalves.com@bgoncalves
Analogies
• The embedding of each word is a function of the context it appears in:

• words that appear in similar contexts will have similar embeddings:
• “Distributional hypotesis” in linguistics
(red) = f (context (red))
context (red) ⇡ context (blue) =) (red) ⇡ (blue)
“You shall know a word by the company it keeps”

(J. R. Firth)
Geometrical relations
between contexts imply
semantic relations
between words!
Paris
Rome
Washington DC
Lisbon
France
PortugalItaly
USA
Capital context
Country context
(France) (Paris) + (Rome) = (Italy)
~b ~a + ~c = ~d
www.bgoncalves.com@bgoncalves
Analogies https://www.tensorflow.org/tutorials/word2vec
What is the word d that is most similar to 

b and c and most dissimilar to a?
d†
= argmax
x
⇣
~b ~a + ~c
⌘T
~b ~a + ~c
~x
d†
⇠ argmax
x
⇣
~bT
~x ~aT
~x + ~cT
~x
⌘
~b ~a + ~c = ~d
https://github.com/bmtgoncalves/word2vec-and-friends/
www.bgoncalves.com@bgoncalves
Tensorflow
www.bgoncalves.com@bgoncalves
• Let’s imagine I want to perform these calculations:







• for some given .
• To calculate we must follow a certain sequence of operations.
A diversion… https://www.tensorflow.org/
y = f (x)
z = g (y)
x
z
Apply
Assign
Apply
Assign
www.bgoncalves.com@bgoncalves
• Let’s imagine I want to perform these calculations:







• for some given .
• To calculate we must follow a certain sequence of operations.
• Which can be shortened if we are interested in just the value of
• In Tensorflow, this is called a Computational Graph and it’s the
most fundamental concept to understand
• Data, in the form of tensors, flows through the graph from inputs
to outputs
• Tensorflow, is, essentially, a way of defining arbitrary computational
graphs in a way that can be automatically distributed and
optimized.
A diversion… https://www.tensorflow.org/
y = f (x)
z = g (y)
x
z
Apply
Assign
Apply
Assign
y
www.bgoncalves.com@bgoncalves
• If we use base functions, tensorflow knows how to automatically calculate the respective
gradients
• Automatic BackProp
• Graphs can have multiple outputs
• Predictions
• Cost functions
• etc…
Computational Graphs https://www.tensorflow.org/
www.bgoncalves.com@bgoncalves
Sessions
• After we have defined the computational graph, we can start using it
to make calculations
• All computations must take place within a “session” that defines the
values of all required input values
https://www.tensorflow.org/
• Which values are required for a specific computation depend on
what part of the graph is actually being executed.
• When you request the value of a specific output, tensorflow
determines what is the specific subgraph that must be executed
and what are the required input values.
• For optimization purposes, it can also execute independent parts of
the graph in different devices (CPUs, GPUs, TPUs, etc) at the same
time.
www.bgoncalves.com@bgoncalves
Installation https://www.tensorflow.org/install/
@bgoncalves
A basic Tensorflow program https://www.tensorflow.org/
import tensorflow as tf
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
c = tf.constant(3.)
m = tf.add(x, y)
z = tf.multiply(m, c)
with tf.Session() as sess:
output = sess.run(z, feed_dict={x: 1., y: 2.})
print("Output value is:", output)
basic.py
z = c ⇤ (x + y)
Placeholders
Constant
add
multiply
assign
assign
@bgoncalves
A basic Tensorflow program https://www.tensorflow.org/
import tensorflow as tf
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
c = tf.constant(3.)
m = tf.add(x, y)
z = tf.multiply(m, c)
with tf.Session() as sess:
output = sess.run(z, feed_dict={x: 1., y: 2.})
print("Output value is:", output)
z = c ⇤ (x + y)
Placeholders
Constant
add
multiply
assign
1 2
assign
9
Inputs
Jupyter Notebook
www.bgoncalves.com@bgoncalves
Linguistic Change
• Train word embeddings for different years using
Google Books
• Independently trained embeddings differ by an
arbitrary rotation
• Align the different embeddings for different years
• Track the way in which the meaning of words shifted
over time!
Statistically Significant Detection of Linguistic Change
Vivek Kulkarni
Stony Brook University, USA
vvkulkarni@cs.stonybrook.edu
Rami Al-Rfou
Stony Brook University, USA
ralrfou@cs.stonybrook.edu
Bryan Perozzi
Stony Brook University, USA
bperozzi@cs.stonybrook.edu
Steven Skiena
Stony Brook University, USA
skiena@cs.stonybrook.edu
ABSTRACT
We propose a new computational approach for tracking and
detecting statistically significant linguistic shifts in the mean-
ing and usage of words. Such linguistic shifts are especially
prevalent on the Internet, where the rapid exchange of ideas
can quickly change a word’s meaning. Our meta-analysis
approach constructs property time series of word usage, and
then uses statistically sound change point detection algo-
rithms to identify significant linguistic shifts.
We consider and analyze three approaches of increasing
complexity to generate such linguistic property time series,
the culmination of which uses distributional characteristics
inferred from word co-occurrences. Using recently proposed
deep neural language models, we first train vector represen-
tations of words for each time period. Second, we warp the
vector spaces into one unified coordinate system. Finally, we
construct a distance-based distributional time series for each
word to track its linguistic displacement over time.
We demonstrate that our approach is scalable by track-
ing linguistic change across years of micro-blogging using
Twitter, a decade of product reviews using a corpus of movie
reviews from Amazon, and a century of written books using
the Google Book Ngrams. Our analysis reveals interesting
patterns of language usage change commensurate with each
medium.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval
Keywords
Web Mining;Computational Linguistics
1. INTRODUCTION
Natural languages are inherently dynamic, evolving over
time to accommodate the needs of their speakers. This
e↵ect is especially prevalent on the Internet, where the rapid
exchange of ideas can change a word’s meaning overnight.
Figure 1: A 2-dimensional projection of the latent seman-
tic space captured by our algorithm. Notice the semantic
trajectory of the word gay transitioning meaning in the space.
In this paper, we study the problem of detecting such
linguistic shifts on a variety of media including micro-blog
posts, product reviews, and books. Specifically, we seek to
detect the broadening and narrowing of semantic senses of
words, as they continually change throughout the lifetime of
a medium.
We propose the first computational approach for track-
ing and detecting statistically significant linguistic shifts of
words. To model the temporal evolution of natural language,
we construct a time series per word. We investigate three
methods to build our word time series. First, we extract
Frequency based statistics to capture sudden changes in word
usage. Second, we construct Syntactic time series by ana-
lyzing each word’s part of speech (POS) tag distribution.
Finally, we infer contextual cues from word co-occurrence
statistics to construct Distributional time series. In order to
detect and establish statistical significance of word changes
over time, we present a change point detection algorithm,
which is compatible with all methods.
Figure 1 illustrates a 2-dimensional projection of the latent
semantic space captured by our Distributional method. We
clearly observe the sequence of semantic shifts that the word
gay has undergone over the last century (1900-2005). Ini-
tially, gay was an adjective that meant cheerful or dapper.
WWW’15, 625 (2015)Statistically Significant Detection of Linguistic Change
Vivek Kulkarni
Stony Brook University, USA
vvkulkarni@cs.stonybrook.edu
Rami Al-Rfou
Stony Brook University, USA
ralrfou@cs.stonybrook.edu
Bryan Perozzi
Stony Brook University, USA
bperozzi@cs.stonybrook.edu
Steven Skiena
Stony Brook University, USA
skiena@cs.stonybrook.edu
ABSTRACT
We propose a new computational approach for tracking and
detecting statistically significant linguistic shifts in the mean-
ing and usage of words. Such linguistic shifts are especially
prevalent on the Internet, where the rapid exchange of ideas
can quickly change a word’s meaning. Our meta-analysis
approach constructs property time series of word usage, and
then uses statistically sound change point detection algo-
rithms to identify significant linguistic shifts.
We consider and analyze three approaches of increasing
complexity to generate such linguistic property time series,
the culmination of which uses distributional characteristics
inferred from word co-occurrences. Using recently proposed
deep neural language models, we first train vector represen-
tations of words for each time period. Second, we warp the
vector spaces into one unified coordinate system. Finally, we
construct a distance-based distributional time series for each
word to track its linguistic displacement over time.
We demonstrate that our approach is scalable by track-
ing linguistic change across years of micro-blogging using
Twitter, a decade of product reviews using a corpus of movie
reviews from Amazon, and a century of written books using
talkative
profligate
courageous
apparitional
dapper
sublimely
unembarrassed
courteous
sorcerers
metonymy
religious
adolescents
philanthropist
illiterate
transgendered
artisans
healthy
gays
homosexual
transgender
lesbian
statesman
hispanic
uneducated
gay1900
gay1950
gay1975
gay1990
gay2005
cheerful
Figure 1: A 2-dimensional projection of the latent seman-
tic space captured by our algorithm. Notice the semantic
trajectory of the word gay transitioning meaning in the space.
In this paper, we study the problem of detecting such
linguistic shifts on a variety of media including micro-blog
posts, product reviews, and books. Specifically, we seek to
www.bgoncalves.com@bgoncalves
node2vec
• You can generate a graph out of a sequence of words by assigning a node to each word and
connecting the words within their neighbors through edges.
• With this representation, a piece of text is a walk through the network. Then perhaps we can
invert the process? Use walks through a network to generate a sequence of nodes that can
be used to train node embeddings?
• node embeddings should capture features of the network structure and allow for detection of
similarities between nodes.
node2vec: Scalable Feature Learning for Networks
Aditya Grover
Stanford University
adityag@cs.stanford.edu
Jure Leskovec
Stanford University
jure@cs.stanford.edu
ABSTRACT
Prediction tasks over nodes and edges in networks require careful
effort in engineering features used by learning algorithms. Recent
research in the broader field of representation learning has led to
significant progress in automating prediction by learning the fea-
tures themselves. However, present feature learning approaches
are not expressive enough to capture the diversity of connectivity
patterns observed in networks.
Here we propose node2vec, an algorithmic framework for learn-
ing continuous feature representations for nodes in networks. In
node2vec, we learn a mapping of nodes to a low-dimensional space
of features that maximizes the likelihood of preserving network
neighborhoods of nodes. We define a flexible notion of a node’s
network neighborhood and design a biased random walk procedure,
which efficiently explores diverse neighborhoods. Our algorithm
generalizes prior work which is based on rigid notions of network
neighborhoods, and we argue that the added flexibility in exploring
neighborhoods is the key to learning richer representations.
We demonstrate the efficacy of node2vec over existing state-of-
the-art techniques on multi-label classification and link prediction
in several real-world networks from diverse domains. Taken to-
gether, our work represents a new way for efficiently learning state-
of-the-art task-independent representations in complex networks.
Categories and Subject Descriptors: H.2.8 [Database Manage-
ment]: Database applications—Data mining; I.2.6 [Artificial In-
telligence]: Learning
General Terms: Algorithms; Experimentation.
Keywords: Information networks, Feature learning, Node embed-
dings, Graph representations.
1. INTRODUCTION
Many important tasks in network analysis involve predictions
over nodes and edges. In a typical node classification task, we
are interested in predicting the most probable labels of nodes in
a network [33]. For example, in a social network, we might be
interested in predicting interests of users, or in a protein-protein in-
teraction network we might be interested in predicting functional
labels of proteins [25, 37]. Similarly, in link prediction, we wish to
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
predict whether a pair of nodes in a network should have an edge
connecting them [18]. Link prediction is useful in a wide variety
of domains; for instance, in genomics, it helps us discover novel
interactions between genes, and in social networks, it can identify
real-world friends [2, 34].
Any supervised machine learning algorithm requires a set of in-
formative, discriminating, and independent features. In prediction
problems on networks this means that one has to construct a feature
vector representation for the nodes and edges. A typical solution in-
volves hand-engineering domain-specific features based on expert
knowledge. Even if one discounts the tedious effort required for
feature engineering, such features are usually designed for specific
tasks and do not generalize across different prediction tasks.
An alternative approach is to learn feature representations by
solving an optimization problem [4]. The challenge in feature learn-
ing is defining an objective function, which involves a trade-off
in balancing computational efficiency and predictive accuracy. On
one side of the spectrum, one could directly aim to find a feature
representation that optimizes performance of a downstream predic-
tion task. While this supervised procedure results in good accu-
racy, it comes at the cost of high training time complexity due to a
blowup in the number of parameters that need to be estimated. At
the other extreme, the objective function can be defined to be inde-
pendent of the downstream prediction task and the representations
can be learned in a purely unsupervised way. This makes the op-
timization computationally efficient and with a carefully designed
objective, it results in task-independent features that closely match
task-specific approaches in predictive accuracy [21, 23].
However, current techniques fail to satisfactorily define and opti-
mize a reasonable objective required for scalable unsupervised fea-
ture learning in networks. Classic approaches based on linear and
non-linear dimensionality reduction techniques such as Principal
Component Analysis, Multi-Dimensional Scaling and their exten-
sions [3, 27, 30, 35] optimize an objective that transforms a repre-
sentative data matrix of the network such that it maximizes the vari-
ance of the data representation. Consequently, these approaches in-
variably involve eigendecomposition of the appropriate data matrix
which is expensive for large real-world networks. Moreover, the
resulting latent representations give poor performance on various
prediction tasks over networks.
Alternatively, we can design an objective that seeks to preserve
local neighborhoods of nodes. The objective can be efficiently op-
arXiv:1607.00653v1[cs.SI]3Jul2016
KDD’16, 855 (2016)
www.bgoncalves.com@bgoncalves
node2vec
• The features depends strongly on the way in which the network is
traversed
• Generate the contexts for each node using Breath First Search and
Depth First Search







• Perform a biased Random Walk
KDD’16, 855 (2016)
n communities they belong to (i.e., ho-
he organization could be based on the
n the network (i.e., structural equiva-
tance, in Figure 1, we observe nodes
ame tightly knit community of nodes,
the two distinct communities share the
node. Real-world networks commonly
uivalences. Thus, it is essential to allow
can learn node representations obeying
earn representations that embed nodes
mmunity closely together, as well as to
nodes that share similar roles have sim-
d allow feature learning algorithms to
iety of domains and prediction tasks.
node2vec, a semi-supervised algorithm
g in networks. We optimize a custom
on using SGD motivated by prior work
ing [21]. Intuitively, our approach re-
s that maximize the likelihood of pre-
oods of nodes in a d-dimensional fea-
der random walk approach to generate
hoods for nodes.
n defining a flexible notion of a node’s
y choosing an appropriate notion of a
an learn representations that organize
ork roles and/or communities they be-
u
s3
s2
s1
s4
s8
s9
s6
s7
s5
BFS
DFS
Figure 1: BFS and DFS search strategies from node u (k = 3).
principles in network science, providing flexibility in discov-
ering representations conforming to different equivalences.
3. We extend node2vec and other feature learning methods based
on neighborhood preserving objectives, from nodes to pairs
of nodes for edge-based prediction tasks.
4. We empirically evaluate node2vec for multi-label classifica-
tion and link prediction on several real-world datasets.
The rest of the paper is structured as follows. In Section 2, we
briefly survey related work in feature learning for networks. We
present the technical details for feature learning using node2vec
in Section 3. In Section 4, we empirically evaluate node2vec on
prediction tasks over nodes and edges on various real-world net-
works and assess the parameter sensitivity, perturbation analysis,
and scalability aspects of our algorithm. We conclude with a dis-
cussion of the node2vec framework and highlight some promis-
ing directions for future work in Section 5. Datasets and a refer-
ain structural equivalence, it is of-
e local neighborhoods accurately.
nce based on network roles such as
d just by observing the immediate
restricting search to nearby nodes,
on and obtains a microscopic view
ode. Additionally, in BFS, nodes in
d to repeat many times. This is also
ance in characterizing the distribu-
t the source node. However, a very
plored for any given k.
which can explore larger parts of
t
x2
x1
v
x3
α=1
α=1/q
α=1/q
α=1/p
v
α=1
α=1/q
α=1/q
α=1/p
x2
x3t
x1
Figure 2: Illustration of the random walk procedure in node2vec.
The walk just transitioned from t to v and is now evaluating its next
• BFS - Explores only
limited neighborhoods.
Suitable for structural
equivalences
• DFS - Freely explores
neighborhoods and
covers homophiles
communities
• By modifying the
parameter of the model it
can interpolate between
the BFS and DFS
extremes
www.bgoncalves.com@bgoncalves
dna2vec
• Separate the genome into long non-overlapping DNS fragments.
• Convert long DNA fragments into overlapping variable length k-mers
• Train embeddings of each k-mer using Gensim implementation of SkipGram.
• Summing embeddings is related to concatenating k-mers
• Cosign similarity of k-mer embeddings reproduces a biologically motivated
similarity score (Needleman-Wunsch) that is used to align nucleoti
dna2vec: Consistent vector representations of
variable-length k-mers
Patrick Ng
ppn3@cs.cornell.edu
Abstract
One of the ubiquitous representation of long DNA sequence is dividing it into shorter k-mer components.
Unfortunately, the straightforward vector encoding of k-mer as a one-hot vector is vulnerable to the
curse of dimensionality. Worse yet, the distance between any pair of one-hot vectors is equidistant. This
is particularly problematic when applying the latest machine learning algorithms to solve problems in
biological sequence analysis. In this paper, we propose a novel method to train distributed representations
of variable-length k-mers. Our method is based on the popular word embedding model word2vec, which
is trained on a shallow two-layer neural network. Our experiments provide evidence that the summing
of dna2vec vectors is akin to nucleotides concatenation. We also demonstrate that there is correlation
between Needleman-Wunsch similarity score and cosine similarity of dna2vec vectors.
1 Introduction
The usage of k-mer representation has been a popular approach in analyzing long sequence of DNA fragments.
The k-mer representation is simple to understand and compute. Unfortunately, its straightforward vector
encoding as a one-hot vector (i.e. bit vector that consists of all zeros except for a single dimension) is
vulnerable to curse of dimensionality. Specifically, its one-hot vector has dimension exponential to the length
of k. For example, an 8-mer needs a bit vector of dimension 48
= 65536. This is problematic when applying
the latest machine learning algorithms to solve problems in biological sequence analysis, due to the fact that
most of these tools prefer lower-dimensional continuous vectors as input (Suykens and Vandewalle, 1999;
Angermueller et al., 2016; Turian et al., 2010). Worse yet, the distance between any arbitrary pair of one-hot
vectors is equidistant, even though ATGGC should be closer to ATGGG than CACGA.
1.1 Word embeddings
The Natural Language Processing (NLP) research community has a long tradition of using bag-of-words with
one-hot vector, where its dimension is equal to the vocabulary size. Recently, there has been an explosion of
using word embeddings as inputs to machine learning algorithms, especially in the deep learning community
(Mikolov et al., 2013b; LeCun et al., 2015; Bengio et al., 2013). Word embeddings are vectors of real numbers
that are distributed representations of words.
A popular training technique for word embeddings, word2vec (Mikolov et al., 2013a), consists of using a
2-layer neural network that is trained on the current word and its surrounding context words (see Section
2.3). This reconstruction of context of words is loosely inspired by the linguistic concept of distributional
hypothesis, which states that words that appear in the same context have similar meaning (Harris, 1954).
Deep learning algorithms applied with word embeddings have had dramatic improvements in the areas of
machine translation (Sutskever et al., 2014; Bahdanau et al., 2014; Cho et al., 2014), summarization (Chopra
arXiv:1701.06279v1[q-bio.QM]23Jan2017
arXiv: 1701.06279 (2017)
www.bgoncalves.com@bgoncalves
stock2vec
• Apply word2vec to the 40 years of stock market data
• Identify significant semantic similarities between companies working
in the same area
https://medium.com/towards-data-science/stock2vec-from-ml-to-p-e-2e6ba407c24
www.bgoncalves.com@bgoncalves
stock2vec https://medium.com/towards-data-science/stock2vec-from-ml-to-p-e-2e6ba407c24
Thank you!
You can hear me speak more about
word2vec in this weeks podcast!

Word2vec and Friends

  • 1.
  • 2.
    www.bgoncalves.com@bgoncalves • Computers arereally good at crunching numbers but not so much when it comes to words. • Perhaps can we represent words numerically?
 
 Teaching machines to read! a 1 about 2 above 3 after 4 again 5 against 6 all 7 am 8 an 9 and 10 any 11 are 12 aren't 13 as 14 … …
  • 3.
    www.bgoncalves.com@bgoncalves • Computers arereally good at crunching numbers but not so much when it comes to words. • Perhaps can we represent words numerically?
 
 • Can we do it in a way that preserves semantic information?
 
 
 • Words that have similar meanings are used in similar contexts and the context in which a word is used helps us understand it’s meaning. Teaching machines to read! The red house is beautiful.
 The blue house is old.
 The red car is beautiful.
 The blue car is old. “You shall know a word by the company it keeps”
 (J. R. Firth) vafter = (0, 0, 0, 1, 0, 0, · · · ) T vabove = (0, 0, 1, 0, 0, 0, · · · ) T One-hot encoding
  • 4.
    www.bgoncalves.com@bgoncalves Teaching machines toread! ➡Words with similar meanings should have similar representations. ➡From a word we can get some idea about the context where it might appear 
 
 ➡And from the context we have some idea about possible words “You shall know a word by the company it keeps”
 (J. R. Firth) max p (C|w) max p (w|C) The red _____ is beautiful.
 The blue _____ is old. ___ ___ house __ ____.
 ___ ___ car __ _______.

  • 5.
    www.bgoncalves.com@bgoncalves word2vec max p (C|w)max p (w|C) Skipgram Continuous Bag of Words ⇥1 wj1wj+1 wj ⇥2 ⇥2 wj+1 ⇥2 ⇥2 ⇥1 wj wj1 Word Context Context Word ⇥1 ⇥2 wj word embeddings context embeddings one hot vector activation function Mikolov 2013
  • 6.
    www.bgoncalves.com@bgoncalves • Let ustake a better look at a simplified case with a single context word • Words are one-hot encoded vectors of length V • is an matrix so that when we take the product: • We are effectively selecting the j’th column of : • The linear activation function simply passes this value along
 which is then multiplied by , a matrix. • Each element k of the output layer its then given by: • We convert these values to a normalized probability distribution by using the softmax Skipgram ⇥1 softmax wj ⇥2 wj+1 wj = (0, 0, 1, 0, 0, 0, · · · ) T ⇥1 (M ⇥ V ) ⇥1 · wj ⇥1 vj = ⇥1 · wj ⇥2 (V ⇥ M) uT k · vj
  • 7.
    www.bgoncalves.com@bgoncalves • A standardway of converting a set of number to a normalized probability distribution:
 
 • With this final ingredient we obtain:
 
 • Our goal is then to learn: • so that we can predict what the next word is likely to be using:
 • But how can we quantify how far we are from the correct answer? Our error measure shouldn’t be just binary (right or wrong)… Softmax ⇥1 softmax wj ⇥2 wj+1 softmax (x) = exp (xj) P l exp (xl) p (wk|wj) ⌘ softmax uT k · vj = exp uT k · vj P l exp uT l · vj p (wj+1|wj) ⇥1 ⇥2
  • 8.
    www.bgoncalves.com@bgoncalves Cross-Entropy • First wehave to recall that what we are, in effect, comparing two probability distributions: • and the one-hot encoding of the context: • The Cross Entropy measures the distance, in number of bits, between two probability distributions p and q: 
 • In our case, this becomes: • So it’s clear that the only non zero term is the one that corresponds to the “hot” element of 
 • This is our Error function. But how can we use this to update the values of and ? p (wk|wj) H (p, q) = X k pk log qk H = log p (wj+1|wj) wj+1 wj+1 = (0, 0, 0, 1, 0, 0, · · · ) T H [wj+1, p (wk|wj)] = X k wk j+1 log p (wk|wj) ⇥1 ⇥2
  • 9.
    www.bgoncalves.com@bgoncalves Gradient Descent • Findthe gradient for each training batch • Take a step downhill along the direction of the gradient 
 
 • where is the step size. • Repeat until “convergence”. H ✓mn ✓mn ↵ @H @✓mn @H @✓mn ↵
  • 10.
    www.bgoncalves.com@bgoncalves Chain-rule • How canwe calculate
 • we rewrite:
 • and expand: • Then we can rewrite:
 • and apply the chain rule: @H @✓mn = @ @✓mn log p (wj+1|wj) @H @✓mn = @ @✓mn log exp uT k · vj P l exp uT l · vj uT k · vj = X q ✓ (2) kq ✓ (1) qj @f (g (x)) @x = @f (g (x)) @g (x) @g (x) @x @H @✓mn = @ @✓mn " uT k · vj log X l exp uT l · vj # ✓mn = n ✓(1) mn, ✓(2) mn o
  • 11.
    www.bgoncalves.com@bgoncalves Training procedures • onlinelearning - update weights after each case
 - might be useful to update model as new data is obtained
 - subject to fluctuations • mini-batch - update weights after a “small” number of cases
 - batches should be balanced
 - if dataset is redundant, the gradient estimated using only a fraction of the data 
 is a good approximation to the full gradient. • momentum - let gradient change the velocity of weight change instead of the value directly • rmsprop - divide learning rate for each weight by a running average of “recent” gradients • learning rate - vary over the course of the training procedure and use different learning rates
 for each weight
  • 12.
    www.bgoncalves.com@bgoncalves SkipGram with LargerContexts • Use the same for all context words. • Use the average of cross entropy. • word order is not important (the average does not change) • Can essentially be trained one context word at at time.. ⇥1 softmax wj ⇥2 wj+1 ⇥1 wj1wj+1 wj ⇥2 ⇥2 H = log p (wj+1|wj) H = 1 T X t log p (wj+t|wj) ⇥2
  • 13.
    www.bgoncalves.com@bgoncalves Continuous Bag ofWords • The process is essentially the same wj+1 ⇥2 ⇥2 ⇥1 wj wj1
  • 14.
    www.bgoncalves.com@bgoncalves Variations • Hierarchical Softmax: •Approximate the softmax using a binary tree • Reduce the number of calculations per training example from to and increase performance by orders of magnitude. • Negative Sampling: • Under sample the most frequent words by removing them from the text before generating the contexts • Similar idea to removing stop-words — very frequent words are less informative. • Effectively makes the window larger, increasing the amount of information available for context V log2 V
  • 15.
    www.bgoncalves.com@bgoncalves Comments • word2vec, evenin its original formulation is actually a family of algorithms using various combinations of: • Skip-gram, CBOW • Hierarchical Softmax, Negative Sampling • The output of this neural network is deterministic: • If two words appear in the same context (“blue” vs “red”, for e.g.), they will have similar internal representations in 
 and • and are vector embeddings of the input words and the context words respectively • Words that are too rare are also removed. • The original implementation had a dynamic window size: • for each word in the corpus a window size k’ is sampled uniformly between 1 and k ⇥1 wj1wj+1 wj ⇥2 ⇥2 ⇥1 ⇥2 ⇥1 ⇥2
  • 16.
    www.bgoncalves.com@bgoncalves Online resources • C- https://code.google.com/archive/p/word2vec/ (the original one) • Python/tensorflow - https://www.tensorflow.org/tutorials/word2vec • Both a minimalist and an efficient versions are available in the tutorial • Python/gensim - https://radimrehurek.com/gensim/models/word2vec.html • Pretrained embeddings: • 30+ languages, https://github.com/Kyubyong/wordvectors • 100+ languages trained using wikipedia: https://sites.google.com/site/rmyeid/projects/polyglot
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
    www.bgoncalves.com@bgoncalves Analogies • The embeddingof each word is a function of the context it appears in:
 • words that appear in similar contexts will have similar embeddings: • “Distributional hypotesis” in linguistics (red) = f (context (red)) context (red) ⇡ context (blue) =) (red) ⇡ (blue) “You shall know a word by the company it keeps”
 (J. R. Firth) Geometrical relations between contexts imply semantic relations between words! Paris Rome Washington DC Lisbon France PortugalItaly USA Capital context Country context (France) (Paris) + (Rome) = (Italy) ~b ~a + ~c = ~d
  • 24.
    www.bgoncalves.com@bgoncalves Analogies https://www.tensorflow.org/tutorials/word2vec What isthe word d that is most similar to 
 b and c and most dissimilar to a? d† = argmax x ⇣ ~b ~a + ~c ⌘T ~b ~a + ~c ~x d† ⇠ argmax x ⇣ ~bT ~x ~aT ~x + ~cT ~x ⌘ ~b ~a + ~c = ~d
  • 25.
  • 26.
  • 27.
    www.bgoncalves.com@bgoncalves • Let’s imagineI want to perform these calculations:
 
 
 
 • for some given . • To calculate we must follow a certain sequence of operations. A diversion… https://www.tensorflow.org/ y = f (x) z = g (y) x z Apply Assign Apply Assign
  • 28.
    www.bgoncalves.com@bgoncalves • Let’s imagineI want to perform these calculations:
 
 
 
 • for some given . • To calculate we must follow a certain sequence of operations. • Which can be shortened if we are interested in just the value of • In Tensorflow, this is called a Computational Graph and it’s the most fundamental concept to understand • Data, in the form of tensors, flows through the graph from inputs to outputs • Tensorflow, is, essentially, a way of defining arbitrary computational graphs in a way that can be automatically distributed and optimized. A diversion… https://www.tensorflow.org/ y = f (x) z = g (y) x z Apply Assign Apply Assign y
  • 29.
    www.bgoncalves.com@bgoncalves • If weuse base functions, tensorflow knows how to automatically calculate the respective gradients • Automatic BackProp • Graphs can have multiple outputs • Predictions • Cost functions • etc… Computational Graphs https://www.tensorflow.org/
  • 30.
    www.bgoncalves.com@bgoncalves Sessions • After wehave defined the computational graph, we can start using it to make calculations • All computations must take place within a “session” that defines the values of all required input values https://www.tensorflow.org/ • Which values are required for a specific computation depend on what part of the graph is actually being executed. • When you request the value of a specific output, tensorflow determines what is the specific subgraph that must be executed and what are the required input values. • For optimization purposes, it can also execute independent parts of the graph in different devices (CPUs, GPUs, TPUs, etc) at the same time.
  • 31.
  • 32.
    @bgoncalves A basic Tensorflowprogram https://www.tensorflow.org/ import tensorflow as tf x = tf.placeholder(tf.float32) y = tf.placeholder(tf.float32) c = tf.constant(3.) m = tf.add(x, y) z = tf.multiply(m, c) with tf.Session() as sess: output = sess.run(z, feed_dict={x: 1., y: 2.}) print("Output value is:", output) basic.py z = c ⇤ (x + y) Placeholders Constant add multiply assign assign
  • 33.
    @bgoncalves A basic Tensorflowprogram https://www.tensorflow.org/ import tensorflow as tf x = tf.placeholder(tf.float32) y = tf.placeholder(tf.float32) c = tf.constant(3.) m = tf.add(x, y) z = tf.multiply(m, c) with tf.Session() as sess: output = sess.run(z, feed_dict={x: 1., y: 2.}) print("Output value is:", output) z = c ⇤ (x + y) Placeholders Constant add multiply assign 1 2 assign 9 Inputs
  • 34.
  • 35.
    www.bgoncalves.com@bgoncalves Linguistic Change • Trainword embeddings for different years using Google Books • Independently trained embeddings differ by an arbitrary rotation • Align the different embeddings for different years • Track the way in which the meaning of words shifted over time! Statistically Significant Detection of Linguistic Change Vivek Kulkarni Stony Brook University, USA vvkulkarni@cs.stonybrook.edu Rami Al-Rfou Stony Brook University, USA ralrfou@cs.stonybrook.edu Bryan Perozzi Stony Brook University, USA bperozzi@cs.stonybrook.edu Steven Skiena Stony Brook University, USA skiena@cs.stonybrook.edu ABSTRACT We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the mean- ing and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word’s meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algo- rithms to identify significant linguistic shifts. We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector represen- tations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track its linguistic displacement over time. We demonstrate that our approach is scalable by track- ing linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book Ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval Keywords Web Mining;Computational Linguistics 1. INTRODUCTION Natural languages are inherently dynamic, evolving over time to accommodate the needs of their speakers. This e↵ect is especially prevalent on the Internet, where the rapid exchange of ideas can change a word’s meaning overnight. Figure 1: A 2-dimensional projection of the latent seman- tic space captured by our algorithm. Notice the semantic trajectory of the word gay transitioning meaning in the space. In this paper, we study the problem of detecting such linguistic shifts on a variety of media including micro-blog posts, product reviews, and books. Specifically, we seek to detect the broadening and narrowing of semantic senses of words, as they continually change throughout the lifetime of a medium. We propose the first computational approach for track- ing and detecting statistically significant linguistic shifts of words. To model the temporal evolution of natural language, we construct a time series per word. We investigate three methods to build our word time series. First, we extract Frequency based statistics to capture sudden changes in word usage. Second, we construct Syntactic time series by ana- lyzing each word’s part of speech (POS) tag distribution. Finally, we infer contextual cues from word co-occurrence statistics to construct Distributional time series. In order to detect and establish statistical significance of word changes over time, we present a change point detection algorithm, which is compatible with all methods. Figure 1 illustrates a 2-dimensional projection of the latent semantic space captured by our Distributional method. We clearly observe the sequence of semantic shifts that the word gay has undergone over the last century (1900-2005). Ini- tially, gay was an adjective that meant cheerful or dapper. WWW’15, 625 (2015)Statistically Significant Detection of Linguistic Change Vivek Kulkarni Stony Brook University, USA vvkulkarni@cs.stonybrook.edu Rami Al-Rfou Stony Brook University, USA ralrfou@cs.stonybrook.edu Bryan Perozzi Stony Brook University, USA bperozzi@cs.stonybrook.edu Steven Skiena Stony Brook University, USA skiena@cs.stonybrook.edu ABSTRACT We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the mean- ing and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word’s meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algo- rithms to identify significant linguistic shifts. We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector represen- tations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track its linguistic displacement over time. We demonstrate that our approach is scalable by track- ing linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using talkative profligate courageous apparitional dapper sublimely unembarrassed courteous sorcerers metonymy religious adolescents philanthropist illiterate transgendered artisans healthy gays homosexual transgender lesbian statesman hispanic uneducated gay1900 gay1950 gay1975 gay1990 gay2005 cheerful Figure 1: A 2-dimensional projection of the latent seman- tic space captured by our algorithm. Notice the semantic trajectory of the word gay transitioning meaning in the space. In this paper, we study the problem of detecting such linguistic shifts on a variety of media including micro-blog posts, product reviews, and books. Specifically, we seek to
  • 36.
    www.bgoncalves.com@bgoncalves node2vec • You cangenerate a graph out of a sequence of words by assigning a node to each word and connecting the words within their neighbors through edges. • With this representation, a piece of text is a walk through the network. Then perhaps we can invert the process? Use walks through a network to generate a sequence of nodes that can be used to train node embeddings? • node embeddings should capture features of the network structure and allow for detection of similarities between nodes. node2vec: Scalable Feature Learning for Networks Aditya Grover Stanford University adityag@cs.stanford.edu Jure Leskovec Stanford University jure@cs.stanford.edu ABSTRACT Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the fea- tures themselves. However, present feature learning approaches are not expressive enough to capture the diversity of connectivity patterns observed in networks. Here we propose node2vec, an algorithmic framework for learn- ing continuous feature representations for nodes in networks. In node2vec, we learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. We define a flexible notion of a node’s network neighborhood and design a biased random walk procedure, which efficiently explores diverse neighborhoods. Our algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and we argue that the added flexibility in exploring neighborhoods is the key to learning richer representations. We demonstrate the efficacy of node2vec over existing state-of- the-art techniques on multi-label classification and link prediction in several real-world networks from diverse domains. Taken to- gether, our work represents a new way for efficiently learning state- of-the-art task-independent representations in complex networks. Categories and Subject Descriptors: H.2.8 [Database Manage- ment]: Database applications—Data mining; I.2.6 [Artificial In- telligence]: Learning General Terms: Algorithms; Experimentation. Keywords: Information networks, Feature learning, Node embed- dings, Graph representations. 1. INTRODUCTION Many important tasks in network analysis involve predictions over nodes and edges. In a typical node classification task, we are interested in predicting the most probable labels of nodes in a network [33]. For example, in a social network, we might be interested in predicting interests of users, or in a protein-protein in- teraction network we might be interested in predicting functional labels of proteins [25, 37]. Similarly, in link prediction, we wish to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed predict whether a pair of nodes in a network should have an edge connecting them [18]. Link prediction is useful in a wide variety of domains; for instance, in genomics, it helps us discover novel interactions between genes, and in social networks, it can identify real-world friends [2, 34]. Any supervised machine learning algorithm requires a set of in- formative, discriminating, and independent features. In prediction problems on networks this means that one has to construct a feature vector representation for the nodes and edges. A typical solution in- volves hand-engineering domain-specific features based on expert knowledge. Even if one discounts the tedious effort required for feature engineering, such features are usually designed for specific tasks and do not generalize across different prediction tasks. An alternative approach is to learn feature representations by solving an optimization problem [4]. The challenge in feature learn- ing is defining an objective function, which involves a trade-off in balancing computational efficiency and predictive accuracy. On one side of the spectrum, one could directly aim to find a feature representation that optimizes performance of a downstream predic- tion task. While this supervised procedure results in good accu- racy, it comes at the cost of high training time complexity due to a blowup in the number of parameters that need to be estimated. At the other extreme, the objective function can be defined to be inde- pendent of the downstream prediction task and the representations can be learned in a purely unsupervised way. This makes the op- timization computationally efficient and with a carefully designed objective, it results in task-independent features that closely match task-specific approaches in predictive accuracy [21, 23]. However, current techniques fail to satisfactorily define and opti- mize a reasonable objective required for scalable unsupervised fea- ture learning in networks. Classic approaches based on linear and non-linear dimensionality reduction techniques such as Principal Component Analysis, Multi-Dimensional Scaling and their exten- sions [3, 27, 30, 35] optimize an objective that transforms a repre- sentative data matrix of the network such that it maximizes the vari- ance of the data representation. Consequently, these approaches in- variably involve eigendecomposition of the appropriate data matrix which is expensive for large real-world networks. Moreover, the resulting latent representations give poor performance on various prediction tasks over networks. Alternatively, we can design an objective that seeks to preserve local neighborhoods of nodes. The objective can be efficiently op- arXiv:1607.00653v1[cs.SI]3Jul2016 KDD’16, 855 (2016)
  • 37.
    www.bgoncalves.com@bgoncalves node2vec • The featuresdepends strongly on the way in which the network is traversed • Generate the contexts for each node using Breath First Search and Depth First Search
 
 
 
 • Perform a biased Random Walk KDD’16, 855 (2016) n communities they belong to (i.e., ho- he organization could be based on the n the network (i.e., structural equiva- tance, in Figure 1, we observe nodes ame tightly knit community of nodes, the two distinct communities share the node. Real-world networks commonly uivalences. Thus, it is essential to allow can learn node representations obeying earn representations that embed nodes mmunity closely together, as well as to nodes that share similar roles have sim- d allow feature learning algorithms to iety of domains and prediction tasks. node2vec, a semi-supervised algorithm g in networks. We optimize a custom on using SGD motivated by prior work ing [21]. Intuitively, our approach re- s that maximize the likelihood of pre- oods of nodes in a d-dimensional fea- der random walk approach to generate hoods for nodes. n defining a flexible notion of a node’s y choosing an appropriate notion of a an learn representations that organize ork roles and/or communities they be- u s3 s2 s1 s4 s8 s9 s6 s7 s5 BFS DFS Figure 1: BFS and DFS search strategies from node u (k = 3). principles in network science, providing flexibility in discov- ering representations conforming to different equivalences. 3. We extend node2vec and other feature learning methods based on neighborhood preserving objectives, from nodes to pairs of nodes for edge-based prediction tasks. 4. We empirically evaluate node2vec for multi-label classifica- tion and link prediction on several real-world datasets. The rest of the paper is structured as follows. In Section 2, we briefly survey related work in feature learning for networks. We present the technical details for feature learning using node2vec in Section 3. In Section 4, we empirically evaluate node2vec on prediction tasks over nodes and edges on various real-world net- works and assess the parameter sensitivity, perturbation analysis, and scalability aspects of our algorithm. We conclude with a dis- cussion of the node2vec framework and highlight some promis- ing directions for future work in Section 5. Datasets and a refer- ain structural equivalence, it is of- e local neighborhoods accurately. nce based on network roles such as d just by observing the immediate restricting search to nearby nodes, on and obtains a microscopic view ode. Additionally, in BFS, nodes in d to repeat many times. This is also ance in characterizing the distribu- t the source node. However, a very plored for any given k. which can explore larger parts of t x2 x1 v x3 α=1 α=1/q α=1/q α=1/p v α=1 α=1/q α=1/q α=1/p x2 x3t x1 Figure 2: Illustration of the random walk procedure in node2vec. The walk just transitioned from t to v and is now evaluating its next • BFS - Explores only limited neighborhoods. Suitable for structural equivalences • DFS - Freely explores neighborhoods and covers homophiles communities • By modifying the parameter of the model it can interpolate between the BFS and DFS extremes
  • 38.
    www.bgoncalves.com@bgoncalves dna2vec • Separate thegenome into long non-overlapping DNS fragments. • Convert long DNA fragments into overlapping variable length k-mers • Train embeddings of each k-mer using Gensim implementation of SkipGram. • Summing embeddings is related to concatenating k-mers • Cosign similarity of k-mer embeddings reproduces a biologically motivated similarity score (Needleman-Wunsch) that is used to align nucleoti dna2vec: Consistent vector representations of variable-length k-mers Patrick Ng ppn3@cs.cornell.edu Abstract One of the ubiquitous representation of long DNA sequence is dividing it into shorter k-mer components. Unfortunately, the straightforward vector encoding of k-mer as a one-hot vector is vulnerable to the curse of dimensionality. Worse yet, the distance between any pair of one-hot vectors is equidistant. This is particularly problematic when applying the latest machine learning algorithms to solve problems in biological sequence analysis. In this paper, we propose a novel method to train distributed representations of variable-length k-mers. Our method is based on the popular word embedding model word2vec, which is trained on a shallow two-layer neural network. Our experiments provide evidence that the summing of dna2vec vectors is akin to nucleotides concatenation. We also demonstrate that there is correlation between Needleman-Wunsch similarity score and cosine similarity of dna2vec vectors. 1 Introduction The usage of k-mer representation has been a popular approach in analyzing long sequence of DNA fragments. The k-mer representation is simple to understand and compute. Unfortunately, its straightforward vector encoding as a one-hot vector (i.e. bit vector that consists of all zeros except for a single dimension) is vulnerable to curse of dimensionality. Specifically, its one-hot vector has dimension exponential to the length of k. For example, an 8-mer needs a bit vector of dimension 48 = 65536. This is problematic when applying the latest machine learning algorithms to solve problems in biological sequence analysis, due to the fact that most of these tools prefer lower-dimensional continuous vectors as input (Suykens and Vandewalle, 1999; Angermueller et al., 2016; Turian et al., 2010). Worse yet, the distance between any arbitrary pair of one-hot vectors is equidistant, even though ATGGC should be closer to ATGGG than CACGA. 1.1 Word embeddings The Natural Language Processing (NLP) research community has a long tradition of using bag-of-words with one-hot vector, where its dimension is equal to the vocabulary size. Recently, there has been an explosion of using word embeddings as inputs to machine learning algorithms, especially in the deep learning community (Mikolov et al., 2013b; LeCun et al., 2015; Bengio et al., 2013). Word embeddings are vectors of real numbers that are distributed representations of words. A popular training technique for word embeddings, word2vec (Mikolov et al., 2013a), consists of using a 2-layer neural network that is trained on the current word and its surrounding context words (see Section 2.3). This reconstruction of context of words is loosely inspired by the linguistic concept of distributional hypothesis, which states that words that appear in the same context have similar meaning (Harris, 1954). Deep learning algorithms applied with word embeddings have had dramatic improvements in the areas of machine translation (Sutskever et al., 2014; Bahdanau et al., 2014; Cho et al., 2014), summarization (Chopra arXiv:1701.06279v1[q-bio.QM]23Jan2017 arXiv: 1701.06279 (2017)
  • 39.
    www.bgoncalves.com@bgoncalves stock2vec • Apply word2vecto the 40 years of stock market data • Identify significant semantic similarities between companies working in the same area https://medium.com/towards-data-science/stock2vec-from-ml-to-p-e-2e6ba407c24
  • 40.
  • 41.
    Thank you! You canhear me speak more about word2vec in this weeks podcast!