2 3 M A R C H 2 0 1 8 , L E I D E N U N I V E RS I T Y
WORD2VEC TUTORIAL
SUZAN VERBERNE
CONTENTS
 Introduction
 Background
 What is word2vec?
 What can you do with it?
 Practical session
Suzan Verberne
INTRODUCTORY ROUND
Suzan Verberne
ABOUT ME
 Master (2002): Natural Language Processing (NLP)
 PhD (2010): NLP and Information Retrieval (IR)
 Now: Assistant Professor at the Leiden Institute for Advanced
Computer Science (LIACS), affiliated with the Data Science research
programme.
 Research: projects on text mining and information retrieval in
various domains
 Teaching: Data Science & Text Mining
Suzan Verberne
Suzan Verberne
BACKGROUND
WHERE TO START
 Linguistics: Distributional hypothesis
 Data science: Vector Space Model (VSM)
Suzan Verberne
DISTRIBUTIONAL HYPOTHESIS
 Harris, Z. (1954). “Distributional structure”. Word. 10 (23): 146–162.
 “Words that occur in similar contexts tend to be similar”
Suzan Verberne
VECTOR SPACE MODEL
 Traditional Vector Space Model (Information Retrieval):
 documents and queries represented in a vector space
 where the dimensions are the words
Suzan Verberne
VECTOR SPACE MODEL
Suzan Verberne
VECTOR SPACE MODEL
Suzan Verberne
VECTOR SPACE MODEL
Suzan Verberne
VECTOR SPACE MODEL
 Vector Space Models represent (embed) words or documents in a
continuous vector space
 The vector space is relatively low-dimensional (100 – 320
dimensions instead of 10,000s)
 Semantically and syntactically similar words are mapped to nearby
points (Distributional Hypothesis)
Suzan Verberne
Suzan Verberne
PCA projection of 320-
dimensional vector space
Suzan Verberne
WORD2VEC
WHAT IS WORD2VEC?
 Word2vec is a particularly computationally-efficient predictive
model for learning word embeddings from raw text
 So that similar words are mapped to nearby points
Suzan Verberne
EXAMPLE: PATIENT FORUM MINER
Suzan Verberne
WHERE DOES IT COME FROM
 Neural network language model (NNLM) (Bengio et al., 2003 – not
discussed today)
 Mikolov previously proposed (MSc thesis, PhD thesis, other papers)
to first learn word vectors “using neural network with a single
hidden layer” and then train the NNLM independently
 Word2vec directly extends this work => “word vectors learned
using a simple model”
 Many architectures and models have been proposed for computing
word vectors (e.g. see Socher’s Stanford group work which resulted
in GloVe - http://nlp.stanford.edu/projects/glove/)
Suzan Verberne
WHAT IS WORD2VEC (NOT)?
 It is:
 Neural network-based
 Word embeddings
 VSM-based
 Co-occurrence-based
 It is not:
 Deep learning
Suzan Verberne
Word embedding is the collective name for a
set of language modeling and feature learning
techniques in natural language processing
(NLP) where words or phrases from the
vocabulary are mapped to vectors of real
numbers in a low-dimensional space relative to
the vocabulary size
WORDS IN CONTEXT
Suzan Verberne
HOW IS THE MODEL TRAINED?
 Move through the training corpus with a sliding window. Each word
is a prediction problem:
 the objective is to predict the current word with the help of its
contexts (or vice versa)
 The outcome of the prediction determines whether we adjust the
current word vector. Gradually, vectors converge to (hopefully)
optimal values
It is important that prediction here is not an aim in itself: it is just a
proxy to learn vector representations good for other tasks
Suzan Verberne
OTHER VECTOR SPACE SEMANTICS TECHNIQUES
 Some “classic” NLP for estimating continuous representations of
words
 – LSA (Latent Semantic Analysis)
 – LDA (Latent Dirichlet Allocation)
 Distributed representations of words learned by neural networks
outperform LSA on various tasks
 LDA is computationally expensive and cannot be trained on very
large datasets
Suzan Verberne
ADVANTAGES OF WORD2VEC
 It scales
 Train on billion word corpora
 In limited time
 Possibility of parallel training
 Pre-trained word embeddings trained by one can be used by others
 For entirely different tasks
 Incremental training
 Train on one piece of data, save results, continue training later on
 There is a Python module for it:
 Gensim word2vec
Suzan Verberne
Suzan Verberne
WHAT CAN YOU DO WITH IT?
DO IT YOURSELF
 Implementation in Python package gensim
import gensim
model = gensim.models.Word2Vec(sentences, size=100,
window=5, min_count=5,
workers=4)
size: the dimensionality of the feature vectors (common: 200 or 320)
window: the maximum distance between the current and predicted word
within a sentence
min_count: minimum number of occurrences of a word in the corpus to be
included in the model
workers: for parallellization with multicore machines
Suzan Verberne
DO IT YOURSELF
model.most_similar(‘apple’)
 [(’banana’, 0.8571481704711914), ...]
model.doesnt_match("breakfast cereal dinner lunch".split())
 ‘cereal’
model.similarity(‘woman’, ‘man’)
 0.73723527
Suzan Verberne
WHAT CAN YOU DO WITH IT?
 A is to B as C is to ?
 This is the famous example:
vector(king) – vector(man) + vector(woman) = vector(queen)
 Actually, what the original paper says is: if you substract the vector
for ‘man’ from the one for ‘king’ and add the vector for ‘woman’,
the vector closest to the one you end up with turns out to be the
one for ‘queen’.
 More interesting:
France is to Paris as Germany is to …
Suzan Verberne
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In Advances in
Neural Information Processing Systems, pages 3111–3119, 2013.
WHAT CAN YOU DO WITH IT?
 A is to B as C is to ?
 It also works for syntactic relations:
 vector(biggest) - vector(big) + vector(small) =
Suzan Verberne
vector(smallest)
WHAT CAN YOU DO WITH IT?
 Mining knowledge about natural language
 Improve NLP applications
 Improve Text Mining applications
Suzan Verberne
WHAT CAN YOU DO WITH IT?
 Mining knowledge about natural language
 Selecting out-of-the-list words
 Example: which word does not belong in [monkey, lion, dog, truck]
 Selectional preferences
 Example: predict typical verb-noun pairs: people as subject of eating is more
likely than people as object of eating
 Imrove NLP applications:
 Sentence completion/text prediction
 Bilingual Word Embeddings for Phrase-Based Machine Translation
Suzan Verberne
WHAT CAN YOU DO WITH IT?
 Improve Text Mining applications:
 (Near-)Synonym detection ( query expansion)
 Concept representation of texts
 Example: Twitter sentiment classification
 Document similarity
 Example: cluster news articles per news event
Suzan Verberne
DEAL WITH PHRASES
 Entities as single words:
 during pre-processing: find collocations (e.g. Wikipedia anchor texts),
and feed them as single ‘words’ to the neural network.
 Bigrams (built-in function)
bigram = gensim.models.Phrases(sentence_stream)
term_list = list(bigram[sentence_stream])
model = gensim.models.Word2Vec(term_list)
Suzan Verberne
HOW TO EVALUATE A MODEL
 There is a comprehensive test set that contains five types of
semantic questions, and nine types of syntactic questions
 – 8869 semantic questions – 10675 syntactic questions
 https://rare-technologies.com/word2vec-tutorial/
Suzan Verberne
FURTHER READING
 T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In
Advances in Neural Information Processing Systems, pages 3111– 3119,
2013.
 http://radimrehurek.com/2014/02/word2vec-tutorial/
 http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-
model/
 https://rare-technologies.com/word2vec-tutorial/
Suzan Verberne
Suzan Verberne
PRACTICAL SESSION
PRE-TRAINED WORD2VEC MODELS
 We will compare two pre-trained word2vec models:
 A model trained on a Dutch web corpus (COW)
 A model trained on Dutch web forum data
 Goal: inspect the differences between the models
Suzan Verberne
THANKS
 Traian Rebedea
 Tom Kenter
 Mohammad Mahdavi
 Satoshi Sekine
Suzan Verberne

Tutorial on word2vec

  • 1.
    2 3 MA R C H 2 0 1 8 , L E I D E N U N I V E RS I T Y WORD2VEC TUTORIAL SUZAN VERBERNE
  • 2.
    CONTENTS  Introduction  Background What is word2vec?  What can you do with it?  Practical session Suzan Verberne
  • 3.
  • 4.
    ABOUT ME  Master(2002): Natural Language Processing (NLP)  PhD (2010): NLP and Information Retrieval (IR)  Now: Assistant Professor at the Leiden Institute for Advanced Computer Science (LIACS), affiliated with the Data Science research programme.  Research: projects on text mining and information retrieval in various domains  Teaching: Data Science & Text Mining Suzan Verberne
  • 5.
  • 6.
    WHERE TO START Linguistics: Distributional hypothesis  Data science: Vector Space Model (VSM) Suzan Verberne
  • 7.
    DISTRIBUTIONAL HYPOTHESIS  Harris,Z. (1954). “Distributional structure”. Word. 10 (23): 146–162.  “Words that occur in similar contexts tend to be similar” Suzan Verberne
  • 8.
    VECTOR SPACE MODEL Traditional Vector Space Model (Information Retrieval):  documents and queries represented in a vector space  where the dimensions are the words Suzan Verberne
  • 9.
  • 10.
  • 11.
  • 12.
    VECTOR SPACE MODEL Vector Space Models represent (embed) words or documents in a continuous vector space  The vector space is relatively low-dimensional (100 – 320 dimensions instead of 10,000s)  Semantically and syntactically similar words are mapped to nearby points (Distributional Hypothesis) Suzan Verberne
  • 13.
    Suzan Verberne PCA projectionof 320- dimensional vector space
  • 14.
  • 15.
    WHAT IS WORD2VEC? Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text  So that similar words are mapped to nearby points Suzan Verberne
  • 16.
    EXAMPLE: PATIENT FORUMMINER Suzan Verberne
  • 17.
    WHERE DOES ITCOME FROM  Neural network language model (NNLM) (Bengio et al., 2003 – not discussed today)  Mikolov previously proposed (MSc thesis, PhD thesis, other papers) to first learn word vectors “using neural network with a single hidden layer” and then train the NNLM independently  Word2vec directly extends this work => “word vectors learned using a simple model”  Many architectures and models have been proposed for computing word vectors (e.g. see Socher’s Stanford group work which resulted in GloVe - http://nlp.stanford.edu/projects/glove/) Suzan Verberne
  • 18.
    WHAT IS WORD2VEC(NOT)?  It is:  Neural network-based  Word embeddings  VSM-based  Co-occurrence-based  It is not:  Deep learning Suzan Verberne Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers in a low-dimensional space relative to the vocabulary size
  • 19.
  • 20.
    HOW IS THEMODEL TRAINED?  Move through the training corpus with a sliding window. Each word is a prediction problem:  the objective is to predict the current word with the help of its contexts (or vice versa)  The outcome of the prediction determines whether we adjust the current word vector. Gradually, vectors converge to (hopefully) optimal values It is important that prediction here is not an aim in itself: it is just a proxy to learn vector representations good for other tasks Suzan Verberne
  • 21.
    OTHER VECTOR SPACESEMANTICS TECHNIQUES  Some “classic” NLP for estimating continuous representations of words  – LSA (Latent Semantic Analysis)  – LDA (Latent Dirichlet Allocation)  Distributed representations of words learned by neural networks outperform LSA on various tasks  LDA is computationally expensive and cannot be trained on very large datasets Suzan Verberne
  • 22.
    ADVANTAGES OF WORD2VEC It scales  Train on billion word corpora  In limited time  Possibility of parallel training  Pre-trained word embeddings trained by one can be used by others  For entirely different tasks  Incremental training  Train on one piece of data, save results, continue training later on  There is a Python module for it:  Gensim word2vec Suzan Verberne
  • 23.
    Suzan Verberne WHAT CANYOU DO WITH IT?
  • 24.
    DO IT YOURSELF Implementation in Python package gensim import gensim model = gensim.models.Word2Vec(sentences, size=100, window=5, min_count=5, workers=4) size: the dimensionality of the feature vectors (common: 200 or 320) window: the maximum distance between the current and predicted word within a sentence min_count: minimum number of occurrences of a word in the corpus to be included in the model workers: for parallellization with multicore machines Suzan Verberne
  • 25.
    DO IT YOURSELF model.most_similar(‘apple’) [(’banana’, 0.8571481704711914), ...] model.doesnt_match("breakfast cereal dinner lunch".split())  ‘cereal’ model.similarity(‘woman’, ‘man’)  0.73723527 Suzan Verberne
  • 26.
    WHAT CAN YOUDO WITH IT?  A is to B as C is to ?  This is the famous example: vector(king) – vector(man) + vector(woman) = vector(queen)  Actually, what the original paper says is: if you substract the vector for ‘man’ from the one for ‘king’ and add the vector for ‘woman’, the vector closest to the one you end up with turns out to be the one for ‘queen’.  More interesting: France is to Paris as Germany is to … Suzan Verberne
  • 27.
    T. Mikolov, I.Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.
  • 28.
    WHAT CAN YOUDO WITH IT?  A is to B as C is to ?  It also works for syntactic relations:  vector(biggest) - vector(big) + vector(small) = Suzan Verberne vector(smallest)
  • 29.
    WHAT CAN YOUDO WITH IT?  Mining knowledge about natural language  Improve NLP applications  Improve Text Mining applications Suzan Verberne
  • 30.
    WHAT CAN YOUDO WITH IT?  Mining knowledge about natural language  Selecting out-of-the-list words  Example: which word does not belong in [monkey, lion, dog, truck]  Selectional preferences  Example: predict typical verb-noun pairs: people as subject of eating is more likely than people as object of eating  Imrove NLP applications:  Sentence completion/text prediction  Bilingual Word Embeddings for Phrase-Based Machine Translation Suzan Verberne
  • 31.
    WHAT CAN YOUDO WITH IT?  Improve Text Mining applications:  (Near-)Synonym detection ( query expansion)  Concept representation of texts  Example: Twitter sentiment classification  Document similarity  Example: cluster news articles per news event Suzan Verberne
  • 32.
    DEAL WITH PHRASES Entities as single words:  during pre-processing: find collocations (e.g. Wikipedia anchor texts), and feed them as single ‘words’ to the neural network.  Bigrams (built-in function) bigram = gensim.models.Phrases(sentence_stream) term_list = list(bigram[sentence_stream]) model = gensim.models.Word2Vec(term_list) Suzan Verberne
  • 33.
    HOW TO EVALUATEA MODEL  There is a comprehensive test set that contains five types of semantic questions, and nine types of syntactic questions  – 8869 semantic questions – 10675 syntactic questions  https://rare-technologies.com/word2vec-tutorial/ Suzan Verberne
  • 34.
    FURTHER READING  T.Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111– 3119, 2013.  http://radimrehurek.com/2014/02/word2vec-tutorial/  http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram- model/  https://rare-technologies.com/word2vec-tutorial/ Suzan Verberne
  • 35.
  • 36.
    PRE-TRAINED WORD2VEC MODELS We will compare two pre-trained word2vec models:  A model trained on a Dutch web corpus (COW)  A model trained on Dutch web forum data  Goal: inspect the differences between the models Suzan Verberne
  • 37.
    THANKS  Traian Rebedea Tom Kenter  Mohammad Mahdavi  Satoshi Sekine Suzan Verberne

Editor's Notes