Vector space word 
representations 
Rani Nelken, PhD 
Director of Research, Outbrain 
@RaniNelken
https://www.flickr.com/photos/hyku/295930906/in/photolist-EbXgJ-ajDBs8-9hevWb-s9HX1-5hZqnb-a1Jk8H-a1Mcx7-7QiUWL-6AFs53-9TRtkz-bqt2GQ- 
F574u-F56EA-3imqK7/
Words = atoms?
That would be crazy for numbers 
https://www.flickr.com/photos/proimos/4199675334/
The distributional hypothesis 
What is a word? 
Wittgenstein (1953): The meaning of a word is 
its use in the language 
Firth (1957): You shall know a word by the 
company it keeps
From atomic symbols to vectors 
• Map words to dense numerical vectors 
“representing” their contexts 
• Map words with similar contexts to vectors 
with small angle
History 
• Hard Clustering: Brown clustering 
• Soft clustering: LSA, Random projections, LDA 
• Neural nets
Feedforward Neural Net Language 
Model
Training 
• Input is one-hot vectors of context 
(0…0,1,0…0) 
• We’re trying to learn a vector for each word 
(“projection”) 
• Such that the output is close to the one-hot 
vector of w(t)
Simpler model: Word2Vec
What can we do with these 
representations? 
• Plug them into your existing classifier 
• Plug them into further neural nets – better! 
• Improves accuracy on many NLP tasks 
– Named entity recognition 
– POS tagging 
– sentiment analysis 
– semantic role labeling
Back to cheese… 
• cos(crumbled, cheese) = 0.042 
• cos(crumpled, cheese) = 0.203
And now for the magic 
http://en.wikipedia.org/wiki/Penn_%26_Teller#mediaviewer/File:Penn_and_Teller_(1988).jpg
“Magical” property 
• [Paris] - [France] + [Italy] ≈ [Rome] 
• [king] - [man] + [woman] ≈ [queen] 
• We can use it to solve word analogy problems 
Boston: Red_Sox= New_York: ? 
Demo
Why does it work? 
[king] - [man] + [woman] ≈ [queen] 
cos (x, ([king] – [man] + [woman])) = 
cos (x, [king]) – cos(x, [man]) + cos(x, [woman]) 
[queen] is a good candidate
It doesn’t always work 
• London : England = Baghdad : ? 
• We expect Iraq, but get Mosul 
• We’re looking for a word that is close to 
Baghdad, and to England, but not to London
Why did it fail? 
• London : England = Baghdad : ? 
• cos(Mosul, Baghdad) >> cos(Iraq, London) 
• Instead of adding the cosines, multiply them 
• Improves accuracy
Word2Vec 
• Open source C implementation from Google 
• Comes with pre-learned embeddings 
• Gensim: fast python implementation
Active field of research 
• Bilingual embeddings 
• Joint word and image embeddings 
• Embeddings for sentiment 
• Phrase and document embeddings
Bigger picture: how can we make NLP 
less fragile? 
• 90’s: Linguistic engineering 
• 00’s: Feature engineering 
• 10’s: Unsupervised preprocessing
References 
• https://code.google.com/p/word2vec/ 
• http://www.cs.bgu.ac.il/~yoavg/publications/c 
onll2014analogies.pdf 
• http://radimrehurek.com/2014/02/word2vec-tutorial/
Thanks 
@RaniNelken 
We’re hiring for NLP positions

Vector Space Word Representations - Rani Nelken PhD

  • 1.
    Vector space word representations Rani Nelken, PhD Director of Research, Outbrain @RaniNelken
  • 2.
  • 3.
  • 5.
    That would becrazy for numbers https://www.flickr.com/photos/proimos/4199675334/
  • 6.
    The distributional hypothesis What is a word? Wittgenstein (1953): The meaning of a word is its use in the language Firth (1957): You shall know a word by the company it keeps
  • 7.
    From atomic symbolsto vectors • Map words to dense numerical vectors “representing” their contexts • Map words with similar contexts to vectors with small angle
  • 8.
    History • HardClustering: Brown clustering • Soft clustering: LSA, Random projections, LDA • Neural nets
  • 9.
    Feedforward Neural NetLanguage Model
  • 10.
    Training • Inputis one-hot vectors of context (0…0,1,0…0) • We’re trying to learn a vector for each word (“projection”) • Such that the output is close to the one-hot vector of w(t)
  • 11.
  • 13.
    What can wedo with these representations? • Plug them into your existing classifier • Plug them into further neural nets – better! • Improves accuracy on many NLP tasks – Named entity recognition – POS tagging – sentiment analysis – semantic role labeling
  • 14.
    Back to cheese… • cos(crumbled, cheese) = 0.042 • cos(crumpled, cheese) = 0.203
  • 15.
    And now forthe magic http://en.wikipedia.org/wiki/Penn_%26_Teller#mediaviewer/File:Penn_and_Teller_(1988).jpg
  • 16.
    “Magical” property •[Paris] - [France] + [Italy] ≈ [Rome] • [king] - [man] + [woman] ≈ [queen] • We can use it to solve word analogy problems Boston: Red_Sox= New_York: ? Demo
  • 18.
    Why does itwork? [king] - [man] + [woman] ≈ [queen] cos (x, ([king] – [man] + [woman])) = cos (x, [king]) – cos(x, [man]) + cos(x, [woman]) [queen] is a good candidate
  • 19.
    It doesn’t alwayswork • London : England = Baghdad : ? • We expect Iraq, but get Mosul • We’re looking for a word that is close to Baghdad, and to England, but not to London
  • 20.
    Why did itfail? • London : England = Baghdad : ? • cos(Mosul, Baghdad) >> cos(Iraq, London) • Instead of adding the cosines, multiply them • Improves accuracy
  • 21.
    Word2Vec • Opensource C implementation from Google • Comes with pre-learned embeddings • Gensim: fast python implementation
  • 22.
    Active field ofresearch • Bilingual embeddings • Joint word and image embeddings • Embeddings for sentiment • Phrase and document embeddings
  • 23.
    Bigger picture: howcan we make NLP less fragile? • 90’s: Linguistic engineering • 00’s: Feature engineering • 10’s: Unsupervised preprocessing
  • 24.
    References • https://code.google.com/p/word2vec/ • http://www.cs.bgu.ac.il/~yoavg/publications/c onll2014analogies.pdf • http://radimrehurek.com/2014/02/word2vec-tutorial/
  • 25.
    Thanks @RaniNelken We’rehiring for NLP positions