Tutorial on word2vec

2 3 M A R C H 2 0 1 8 , L E I D E N U N I V E RS I T Y
WORD2VEC TUTORIAL
SUZAN VERBERNE

CONTENTS
 Introduction
 Background
 What is word2vec?
 What can you do with it?
 Practical session
Suzan Verberne

INTRODUCTORY ROUND
Suzan Verberne

ABOUT ME
 Master (2002): Natural Language Processing (NLP)
 PhD (2010): NLP and Information Retrieval (IR)
 Now: Assistant Professor at the Leiden Institute for Advanced
Computer Science (LIACS), affiliated with the Data Science research
programme.
 Research: projects on text mining and information retrieval in
various domains
 Teaching: Data Science & Text Mining
Suzan Verberne

WHERE TO START
 Linguistics: Distributional hypothesis
 Data science: Vector Space Model (VSM)
Suzan Verberne

DISTRIBUTIONAL HYPOTHESIS
 Harris, Z. (1954). “Distributional structure”. Word. 10 (23): 146–162.
 “Words that occur in similar contexts tend to be similar”
Suzan Verberne

VECTOR SPACE MODEL
 Traditional Vector Space Model (Information Retrieval):
 documents and queries represented in a vector space
 where the dimensions are the words
Suzan Verberne

VECTOR SPACE MODEL
Suzan Verberne

VECTOR SPACE MODEL
 Vector Space Models represent (embed) words or documents in a
continuous vector space
 The vector space is relatively low-dimensional (100 – 320
dimensions instead of 10,000s)
 Semantically and syntactically similar words are mapped to nearby
points (Distributional Hypothesis)
Suzan Verberne

Suzan Verberne
PCA projection of 320-
dimensional vector space

WHAT IS WORD2VEC?
 Word2vec is a particularly computationally-efficient predictive
model for learning word embeddings from raw text
 So that similar words are mapped to nearby points
Suzan Verberne

EXAMPLE: PATIENT FORUM MINER
Suzan Verberne

WHERE DOES IT COME FROM
 Neural network language model (NNLM) (Bengio et al., 2003 – not
discussed today)
 Mikolov previously proposed (MSc thesis, PhD thesis, other papers)
to first learn word vectors “using neural network with a single
hidden layer” and then train the NNLM independently
 Word2vec directly extends this work => “word vectors learned
using a simple model”
 Many architectures and models have been proposed for computing
word vectors (e.g. see Socher’s Stanford group work which resulted
in GloVe - http://nlp.stanford.edu/projects/glove/)
Suzan Verberne

WHAT IS WORD2VEC (NOT)?
 It is:
 Neural network-based
 Word embeddings
 VSM-based
 Co-occurrence-based
 It is not:
 Deep learning
Suzan Verberne
Word embedding is the collective name for a
set of language modeling and feature learning
techniques in natural language processing
(NLP) where words or phrases from the
vocabulary are mapped to vectors of real
numbers in a low-dimensional space relative to
the vocabulary size

WORDS IN CONTEXT
Suzan Verberne

HOW IS THE MODEL TRAINED?
 Move through the training corpus with a sliding window. Each word
is a prediction problem:
 the objective is to predict the current word with the help of its
contexts (or vice versa)
 The outcome of the prediction determines whether we adjust the
current word vector. Gradually, vectors converge to (hopefully)
optimal values
It is important that prediction here is not an aim in itself: it is just a
proxy to learn vector representations good for other tasks
Suzan Verberne

OTHER VECTOR SPACE SEMANTICS TECHNIQUES
 Some “classic” NLP for estimating continuous representations of
words
 – LSA (Latent Semantic Analysis)
 – LDA (Latent Dirichlet Allocation)
 Distributed representations of words learned by neural networks
outperform LSA on various tasks
 LDA is computationally expensive and cannot be trained on very
large datasets
Suzan Verberne

ADVANTAGES OF WORD2VEC
 It scales
 Train on billion word corpora
 In limited time
 Possibility of parallel training
 Pre-trained word embeddings trained by one can be used by others
 For entirely different tasks
 Incremental training
 Train on one piece of data, save results, continue training later on
 There is a Python module for it:
 Gensim word2vec
Suzan Verberne

Suzan Verberne
WHAT CAN YOU DO WITH IT?

DO IT YOURSELF
 Implementation in Python package gensim
import gensim
model = gensim.models.Word2Vec(sentences, size=100,
window=5, min_count=5,
workers=4)
size: the dimensionality of the feature vectors (common: 200 or 320)
window: the maximum distance between the current and predicted word
within a sentence
min_count: minimum number of occurrences of a word in the corpus to be
included in the model
workers: for parallellization with multicore machines
Suzan Verberne

DO IT YOURSELF
model.most_similar(‘apple’)
 [(’banana’, 0.8571481704711914), ...]
model.doesnt_match("breakfast cereal dinner lunch".split())
 ‘cereal’
model.similarity(‘woman’, ‘man’)
 0.73723527
Suzan Verberne

 A is to B as C is to ?
 This is the famous example:
vector(king) – vector(man) + vector(woman) = vector(queen)
 Actually, what the original paper says is: if you substract the vector
for ‘man’ from the one for ‘king’ and add the vector for ‘woman’,
the vector closest to the one you end up with turns out to be the
one for ‘queen’.
 More interesting:
France is to Paris as Germany is to …
Suzan Verberne

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In Advances in
Neural Information Processing Systems, pages 3111–3119, 2013.

 A is to B as C is to ?
 It also works for syntactic relations:
 vector(biggest) - vector(big) + vector(small) =
Suzan Verberne
vector(smallest)

 Mining knowledge about natural language
 Improve NLP applications
 Improve Text Mining applications
Suzan Verberne

 Mining knowledge about natural language
 Selecting out-of-the-list words
 Example: which word does not belong in [monkey, lion, dog, truck]
 Selectional preferences
 Example: predict typical verb-noun pairs: people as subject of eating is more
likely than people as object of eating
 Imrove NLP applications:
 Sentence completion/text prediction
 Bilingual Word Embeddings for Phrase-Based Machine Translation
Suzan Verberne

 Improve Text Mining applications:
 (Near-)Synonym detection ( query expansion)
 Concept representation of texts
 Example: Twitter sentiment classification
 Document similarity
 Example: cluster news articles per news event
Suzan Verberne

DEAL WITH PHRASES
 Entities as single words:
 during pre-processing: find collocations (e.g. Wikipedia anchor texts),
and feed them as single ‘words’ to the neural network.
 Bigrams (built-in function)
bigram = gensim.models.Phrases(sentence_stream)
term_list = list(bigram[sentence_stream])
model = gensim.models.Word2Vec(term_list)
Suzan Verberne

HOW TO EVALUATE A MODEL
 There is a comprehensive test set that contains five types of
semantic questions, and nine types of syntactic questions
 – 8869 semantic questions – 10675 syntactic questions
 https://rare-technologies.com/word2vec-tutorial/
Suzan Verberne

FURTHER READING
 T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In
Advances in Neural Information Processing Systems, pages 3111– 3119,
2013.
 http://radimrehurek.com/2014/02/word2vec-tutorial/
 http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-
model/
 https://rare-technologies.com/word2vec-tutorial/
Suzan Verberne

Suzan Verberne
PRACTICAL SESSION

PRE-TRAINED WORD2VEC MODELS
 We will compare two pre-trained word2vec models:
 A model trained on a Dutch web corpus (COW)
 A model trained on Dutch web forum data
 Goal: inspect the differences between the models
Suzan Verberne

THANKS
 Traian Rebedea
 Tom Kenter
 Mohammad Mahdavi
 Satoshi Sekine
Suzan Verberne

Tutorial on word2vec

More Related Content

What's hot

Similar to Tutorial on word2vec

More from Leiden University

Recently uploaded

Tutorial on word2vec

Editor's Notes