1. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WORD Embeddings
A non-exhaustive introduction to Word Embeddings
x
y
z
w
Christian S. Perone
christian.perone@gmail.com
2. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
AGENDA
INTRODUCTION
Philosophy of Language
Vector Space Model
Embeddings
Word Embeddings
Language Modeling
WORD2VEC
Introduction
Semantic Relations
Other properties
WORD MOVERS DISTANCE
Rationale
Model
Results
Q&A
3. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WHO AM I
Christian S. Perone
Machine Learning/Software Engineer
Blog
http://blog.christianperone.com
Open-source projects
https://github.com/perone
Twitter @tarantulae
5. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
PHILOSOPHY OF LANGUAGE
(...) the meaning
of a word is its use
in the language.
āWittgenstein, Ludwig,
Philosophical Investigations ā 1953
6. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
VECTOR SPACE MODEL
Interpreted in a lato sensu, VSM is a space where text is
represented as a vector of numbers instead of its original
textual representation
7. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
VECTOR SPACE MODEL
Interpreted in a lato sensu, VSM is a space where text is
represented as a vector of numbers instead of its original
textual representation
Many approaches to go from other spaces to a vector space
8. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
VECTOR SPACE MODEL
Interpreted in a lato sensu, VSM is a space where text is
represented as a vector of numbers instead of its original
textual representation
Many approaches to go from other spaces to a vector space
Many advantages when you have vectors with special
properties
10. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
EMBEDDINGS
From a space with one dimension per word to a continuous vector
space with much lower dimensionality. From one mathematical object
to another, but preserving āstructureā.
Source: Our beloved scikit-learn.
11. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WORD EMBEDDINGS
Word Model
Word
Embedding
V(cat) = [ 1.4, -1.3, ... ]
Cat
sat
mat
on
cat = [ 0, 1, 0, ... ]
Sparse
Dense
From a sparse representation (usually one-hot encoding) to a
dense representation
Embeddings created as by-product vs explicit model
12. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
LANGUAGE MODELING
P(w1, Ā· Ā· Ā· , wn) =
i
P(wi | w1, Ā· Ā· Ā· , wiā1)
P(āthe cat sat on the matā) > P(āthe mat sat on the catā)
Useful for many different tasks, such as speech recognition,
handwriting recognition, translation, etc.
Naive counting: doesnāt generalize, too many possible
sentences
A word sequence on which the model will be tested is likely to
be different from all the word sequences seen during training.
[Bengio et al, 2003]
Markov assumption / how to approximate it
13. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
MARKOV ASSUMPTION AND N-GRAM MODELS
Markov assumption simpliļ¬es the model, it tries to approximate the
components of the product.
Unigram: P(w1, Ā· Ā· Ā· , wn) =
i
P(wi)
Bigram: P(wi | w1, Ā· Ā· Ā· , wiā1) ā P(wi | wiā1)
Extend to trigram, 4-gram, etc.
14. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WORD EMBEDDINGS
CHARACTERISTICS
Language modeling, low-dimensional and dense but increased complexity.
Examples: neural language models, word2vec, GloVe, etc.
Source: Bengio et al., 2003
Classic neural language model
proposed by Bengio et al. in
2003.
After that, many other important
works by Collobert and
Weston (2008) and then by
Mikolov et al (2013).
16. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WORD2VEC
Unsupervised technique with supervised tasks, takes a text
corpus and produces word embeddings as output. Two different
architectures:
w(t-2)
w(t+1)
w(t-1)
w(t+2)
w(t)
SUM
INPUT PROJECTION OUTPUT
w(t)
INPUT PROJECTION OUTPUT
w(t-2)
w(t-1)
w(t+1)
w(t+2)
CBOW Skip-gram
Figure 2: Graphical representation of the CBOW model and Skip-gram model. In the CBOW model, the distributed
representations of context (or surrounding words) are combined to predict the word in the middle. In the Skip-gram
model, the distributed representation of the input word is used to predict the context.
Source: Exploiting Similarities among Languages for Machine Translation. Mikolov, Thomas et al.
2013.
18. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
AMAZING EMBEDDINGS
Semantic relationships are often preserved on vector operations.
Source: TensorFlow
19. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WORD ANALOGIES
Suppose we have the vector w ā Rn of any given word such as wking,
then we can do:
20. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WORD ANALOGIES
Suppose we have the vector w ā Rn of any given word such as wking,
then we can do:
wking ā wman + wwoman ā wqueen
This vector operation shows that the closest word vector to the
resulting vector is the vector wqueen.
21. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WORD ANALOGIES
Suppose we have the vector w ā Rn of any given word such as wking,
then we can do:
wking ā wman + wwoman ā wqueen
This vector operation shows that the closest word vector to the
resulting vector is the vector wqueen.
This is an amazing property for the word embeddings, because it
means that they carry important relational information that can
be used to many different tasks.
22. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
LANGUAGE STRUCTURE
ā0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
ā0.25
ā0.2
ā0.15
ā0.1
ā0.05
0
0.05
0.1
0.15
one
two
three
four
five
ā
ā0.2 0 0.2 0.4 0.6 0.8 1 1.2
ā0.6
ā0.5
ā0.4
ā0.3
ā0.2
ā0.1
0
0.1
0.2
uno (one)
dos (two)
tres (three)
cuatro (four)
cinco (five)
ā0.3 ā0.25 ā0.2 ā0.15 ā0.1 ā0.05 0 0.05 0.1 0.15
ā0.3
ā0.25
ā0.2
ā0.15
ā0.1
ā0.05
0
0.05
0.1
0.15
0.2
cat
dog
cow
horse
pig
ā0.5 ā0.4 ā0.3 ā0.2 ā0.1 0 0.1 0.2 0.3 0.4 0.5
ā0.5
ā0.4
ā0.3
ā0.2
ā0.1
0
0.1
0.2
0.3
0.4
0.5
gato (cat)
perro (dog)
vaca (cow)
caballo (horse)
cerdo (pig)
Figure 1: Distributed word vector representations of numbers and animals in English (left) and Spanish (right). The ļ¬ve
vectors in each language were projected down to two dimensions using PCA, and then manually rotated to accentuate
their similarity. It can be seen that these concepts have similar geometric arrangements in both spaces, suggesting that
Source: Exploiting Similarities among Languages for Machine Translation. Mikolov, Thomas et al.
2013.
23. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
DEEP LEARNING ?
Word2vec isnāt Deep Learning, the model is actually very shallow.
24. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
DEEP LEARNING ?
Word2vec isnāt Deep Learning, the model is actually very shallow.
However, there is an important relation here, because word
embeddings are usually used to initialize dense LSTM embeddings
for different tasks using deep architectures.
25. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
DEEP LEARNING ?
Word2vec isnāt Deep Learning, the model is actually very shallow.
However, there is an important relation here, because word
embeddings are usually used to initialize dense LSTM embeddings
for different tasks using deep architectures.
Also, you can of course train Word2vec models using techniques
developed in Deep Learning context.
26. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
DEMO
Demo time for some word analogies in Portuguese.
Model trained by Kyubyong Park.
Trained on Wikipedia (pt) - 1.3GB corpus
w ā Rn
where n is 300
Vocabulary size is 50246
Model available at https://github.com/Kyubyong/wordvectors
For comparison, Wikipedia (en) is 13.5GB
28. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WORD MOVERS DISTANCE
Word2vec deļ¬nes a vector for word, but how can we use its
information to compare documents ?
29. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WORD MOVERS DISTANCE
Word2vec deļ¬nes a vector for word, but how can we use its
information to compare documents ?
There are many approaches to represent documents, to mention:
BOW, TF-IDF, N-grams, etc. However, they frequently show
near-orthogonality.
30. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WORD MOVERS DISTANCE
Take the two sentences:
āObama speaks to the media in Illinoisā
and
āThe President greets the press in Chicagoā
31. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WORD MOVERS DISTANCE
Take the two sentences:
āObama speaks to the media in Illinoisā
and
āThe President greets the press in Chicagoā
While these sentences have no words in common, they convey
nearly the same information, a fact that cannot be represented by
the BOW model (Kusner, Matt J. et al. 2015).
32. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WORD MOVERS DISTANCE
KILIAN@WUSTL.EDU
is, MO 63130
āObamaā
word2vec embedding
āPresidentā
āspeaksā
āIllinoisā
āmediaā
āgreetsā
āpressā
āChicagoā
document 2document 1
Obama
speaks
to
the
media
in
Illinois
The
President
greets
the
press
in
Chicago
Figure 1. An illustration of the word moverās distance. All
non-stop words (bold) of both documents are embedded into a
Source: From Word Embeddings To Document Distances. Kusner, Matt J. et al. 2015.
33. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WORD MOVERS DISTANCE
rd Embeddings To Document Distances
3a) for more
ture and the
m model can
f words per
The ability
del to learn
- vec(sushi)
(Einstein) -
) (Mikolov
g is entirely
xt corpus of
ough we use
ghout, other
eston, 2008;
The President greets the press in Chicago.
Obama speaks in Illinois.
1.30
D1
D2
D3
D0
D0 The President greets the press in Chicago.
Obama speaks to the media in Illinois.
The band gave a concert in Japan.
0.49 0.42 0.44
0.200.240.451.07
1.63
+ +=
= + + + 0.28
0.18+
Figure 2. (Top:) The components of the WMD metric between a
query D0 and two sentences D1, D2 (with equal BOW distance).
The arrows represent ļ¬ow between two words and are labeled
with their distance contribution. (Bottom:) The ļ¬ow between two
sentences D3 and D0 with different numbers of words. This mis-
Source: From Word Embeddings To Document Distances. Kusner, Matt J. et al. 2015.
34. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A
WORD MOVERS DISTANCE
From Word Embeddings To Document Distances
1 2 3 4 5 6 7 8
0
10
20
30
40
50
60
70
twitter recipe ohsumed classic reuters amazon
testerror%
43
33
44
33 32 32
29
66
63 61
49 51
44
36
8.0 9.7
62
44 41
35
6.9
5.0
6.7
2.8
33
29
14
8.16.96.3
3.5
59
42
28
14
17
12
9.3
7.4
34
17
22
21
8.4
6.4
4.3
21
4.6
53 53
59
54
48
45
43
51
56 54
58
36
40
31
29
27
20newsbbcsport
k-nearest neighbor error
BOW [Frakes & Baeza-Yates, 1992]
TF-IDF [Jones, 1972]
Okapi BM25 [Robertson & Walker, 1994]
LSI [Deerwester et al., 1990]
LDA [Blei et al., 2003]
mSDA [Chen et al., 2012]
Componential Counting Grid [Perina et al., 2013]
Word Mover's Distance
Figure 3. The kNN test error results on 8 document classiļ¬cation data sets, compared to canonical and state-of-the-art baselines methods.
1 2 3 4 5 6 7 8
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
averageerrorw.r.t.BOW
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0
1.29
1.15
1.0
0.72
0.60 0.55
0.49 0.42
BOW
TF-IDF
Okapi BM25
LSI
LDA
mSDA
CCG
WMD
Figure 4. The kNN test errors of various document metrics aver-
aged over all eight datasets, relative to kNN with BOW.
w, TF(w, D) is its term frequency in document D, |D| is
Table 2. Test error percentage and standard deviation for different
text embeddings. NIPS, AMZ, News are word2vec (w2v) models
trained on different data sets whereas HLBL and Collo were also
obtained with other embedding algorithms.
DOCUMENT k-NEAREST NEIGHBOR RESULTS
DATASET HLBL CW NIPS AMZ NEWS
(W2V) (W2V) (W2V)
BBCSPORT 4.5 8.2 9.5 4.1 5.0
TWITTER 33.3 33.7 29.3 28.1 28.3
RECIPE 47.0 51.6 52.7 47.4 45.1
OHSUMED 52.0 56.2 55.6 50.4 44.5
CLASSIC 5.3 5.5 4.0 3.8 3.0
REUTERS 4.2 4.6 7.1 9.1 3.5
AMAZON 12.3 13.3 13.9 7.8 7.2
Source: From Word Embeddings To Document Distances. Kusner, Matt J. et al. 2015.