Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Word Embeddings - Introduction

14,611 views

Published on

A very short introduction to Word Embeddings.

Published in: Technology

Word Embeddings - Introduction

  1. 1. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD Embeddings A non-exhaustive introduction to Word Embeddings x y z w Christian S. Perone christian.perone@gmail.com
  2. 2. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A AGENDA INTRODUCTION Philosophy of Language Vector Space Model Embeddings Word Embeddings Language Modeling WORD2VEC Introduction Semantic Relations Other properties WORD MOVERS DISTANCE Rationale Model Results Q&A
  3. 3. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WHO AM I Christian S. Perone Machine Learning/Software Engineer Blog http://blog.christianperone.com Open-source projects https://github.com/perone Twitter @tarantulae
  4. 4. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A Section I INTRODUCTION
  5. 5. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A PHILOSOPHY OF LANGUAGE (...) the meaning of a word is its use in the language. —Wittgenstein, Ludwig, Philosophical Investigations – 1953
  6. 6. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A VECTOR SPACE MODEL Interpreted in a lato sensu, VSM is a space where text is represented as a vector of numbers instead of its original textual representation
  7. 7. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A VECTOR SPACE MODEL Interpreted in a lato sensu, VSM is a space where text is represented as a vector of numbers instead of its original textual representation Many approaches to go from other spaces to a vector space
  8. 8. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A VECTOR SPACE MODEL Interpreted in a lato sensu, VSM is a space where text is represented as a vector of numbers instead of its original textual representation Many approaches to go from other spaces to a vector space Many advantages when you have vectors with special properties
  9. 9. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A VECTOR SPACE MODEL
  10. 10. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A EMBEDDINGS From a space with one dimension per word to a continuous vector space with much lower dimensionality. From one mathematical object to another, but preserving “structure”. Source: Our beloved scikit-learn.
  11. 11. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD EMBEDDINGS Word Model Word Embedding V(cat) = [ 1.4, -1.3, ... ] Cat sat mat on cat = [ 0, 1, 0, ... ] Sparse Dense From a sparse representation (usually one-hot encoding) to a dense representation Embeddings created as by-product vs explicit model
  12. 12. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A LANGUAGE MODELING P(w1, · · · , wn) = i P(wi | w1, · · · , wi−1) P(”the cat sat on the mat“) > P(”the mat sat on the cat“) Useful for many different tasks, such as speech recognition, handwriting recognition, translation, etc. Naive counting: doesn’t generalize, too many possible sentences A word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. [Bengio et al, 2003] Markov assumption / how to approximate it
  13. 13. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A MARKOV ASSUMPTION AND N-GRAM MODELS Markov assumption simplifies the model, it tries to approximate the components of the product. Unigram: P(w1, · · · , wn) = i P(wi) Bigram: P(wi | w1, · · · , wi−1) ≈ P(wi | wi−1) Extend to trigram, 4-gram, etc.
  14. 14. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD EMBEDDINGS CHARACTERISTICS Language modeling, low-dimensional and dense but increased complexity. Examples: neural language models, word2vec, GloVe, etc. Source: Bengio et al., 2003 Classic neural language model proposed by Bengio et al. in 2003. After that, many other important works by Collobert and Weston (2008) and then by Mikolov et al (2013).
  15. 15. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A Section II WORD2VEC
  16. 16. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD2VEC Unsupervised technique with supervised tasks, takes a text corpus and produces word embeddings as output. Two different architectures: w(t-2) w(t+1) w(t-1) w(t+2) w(t) SUM INPUT PROJECTION OUTPUT w(t) INPUT PROJECTION OUTPUT w(t-2) w(t-1) w(t+1) w(t+2) CBOW Skip-gram Figure 2: Graphical representation of the CBOW model and Skip-gram model. In the CBOW model, the distributed representations of context (or surrounding words) are combined to predict the word in the middle. In the Skip-gram model, the distributed representation of the input word is used to predict the context. Source: Exploiting Similarities among Languages for Machine Translation. Mikolov, Thomas et al. 2013.
  17. 17. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD2VEC Source: TensorFlow
  18. 18. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A AMAZING EMBEDDINGS Semantic relationships are often preserved on vector operations. Source: TensorFlow
  19. 19. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD ANALOGIES Suppose we have the vector w ∈ Rn of any given word such as wking, then we can do:
  20. 20. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD ANALOGIES Suppose we have the vector w ∈ Rn of any given word such as wking, then we can do: wking − wman + wwoman ≈ wqueen This vector operation shows that the closest word vector to the resulting vector is the vector wqueen.
  21. 21. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD ANALOGIES Suppose we have the vector w ∈ Rn of any given word such as wking, then we can do: wking − wman + wwoman ≈ wqueen This vector operation shows that the closest word vector to the resulting vector is the vector wqueen. This is an amazing property for the word embeddings, because it means that they carry important relational information that can be used to many different tasks.
  22. 22. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A LANGUAGE STRUCTURE −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 one two three four five → −0.2 0 0.2 0.4 0.6 0.8 1 1.2 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 uno (one) dos (two) tres (three) cuatro (four) cinco (five) −0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 −0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 cat dog cow horse pig −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 gato (cat) perro (dog) vaca (cow) caballo (horse) cerdo (pig) Figure 1: Distributed word vector representations of numbers and animals in English (left) and Spanish (right). The five vectors in each language were projected down to two dimensions using PCA, and then manually rotated to accentuate their similarity. It can be seen that these concepts have similar geometric arrangements in both spaces, suggesting that Source: Exploiting Similarities among Languages for Machine Translation. Mikolov, Thomas et al. 2013.
  23. 23. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A DEEP LEARNING ? Word2vec isn’t Deep Learning, the model is actually very shallow.
  24. 24. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A DEEP LEARNING ? Word2vec isn’t Deep Learning, the model is actually very shallow. However, there is an important relation here, because word embeddings are usually used to initialize dense LSTM embeddings for different tasks using deep architectures.
  25. 25. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A DEEP LEARNING ? Word2vec isn’t Deep Learning, the model is actually very shallow. However, there is an important relation here, because word embeddings are usually used to initialize dense LSTM embeddings for different tasks using deep architectures. Also, you can of course train Word2vec models using techniques developed in Deep Learning context.
  26. 26. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A DEMO Demo time for some word analogies in Portuguese. Model trained by Kyubyong Park. Trained on Wikipedia (pt) - 1.3GB corpus w ∈ Rn where n is 300 Vocabulary size is 50246 Model available at https://github.com/Kyubyong/wordvectors For comparison, Wikipedia (en) is 13.5GB
  27. 27. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A Section III WORD MOVERS DISTANCE
  28. 28. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD MOVERS DISTANCE Word2vec defines a vector for word, but how can we use its information to compare documents ?
  29. 29. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD MOVERS DISTANCE Word2vec defines a vector for word, but how can we use its information to compare documents ? There are many approaches to represent documents, to mention: BOW, TF-IDF, N-grams, etc. However, they frequently show near-orthogonality.
  30. 30. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD MOVERS DISTANCE Take the two sentences: “Obama speaks to the media in Illinois” and “The President greets the press in Chicago”
  31. 31. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD MOVERS DISTANCE Take the two sentences: “Obama speaks to the media in Illinois” and “The President greets the press in Chicago” While these sentences have no words in common, they convey nearly the same information, a fact that cannot be represented by the BOW model (Kusner, Matt J. et al. 2015).
  32. 32. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD MOVERS DISTANCE KILIAN@WUSTL.EDU is, MO 63130 ‘Obama’ word2vec embedding ‘President’ ‘speaks’ ‘Illinois’ ‘media’ ‘greets’ ‘press’ ‘Chicago’ document 2document 1 Obama speaks to the media in Illinois The President greets the press in Chicago Figure 1. An illustration of the word mover’s distance. All non-stop words (bold) of both documents are embedded into a Source: From Word Embeddings To Document Distances. Kusner, Matt J. et al. 2015.
  33. 33. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD MOVERS DISTANCE rd Embeddings To Document Distances 3a) for more ture and the m model can f words per The ability del to learn - vec(sushi) (Einstein) - ) (Mikolov g is entirely xt corpus of ough we use ghout, other eston, 2008; The President greets the press in Chicago. Obama speaks in Illinois. 1.30 D1 D2 D3 D0 D0 The President greets the press in Chicago. Obama speaks to the media in Illinois. The band gave a concert in Japan. 0.49 0.42 0.44 0.200.240.451.07 1.63 + += = + + + 0.28 0.18+ Figure 2. (Top:) The components of the WMD metric between a query D0 and two sentences D1, D2 (with equal BOW distance). The arrows represent flow between two words and are labeled with their distance contribution. (Bottom:) The flow between two sentences D3 and D0 with different numbers of words. This mis- Source: From Word Embeddings To Document Distances. Kusner, Matt J. et al. 2015.
  34. 34. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A WORD MOVERS DISTANCE From Word Embeddings To Document Distances 1 2 3 4 5 6 7 8 0 10 20 30 40 50 60 70 twitter recipe ohsumed classic reuters amazon testerror% 43 33 44 33 32 32 29 66 63 61 49 51 44 36 8.0 9.7 62 44 41 35 6.9 5.0 6.7 2.8 33 29 14 8.16.96.3 3.5 59 42 28 14 17 12 9.3 7.4 34 17 22 21 8.4 6.4 4.3 21 4.6 53 53 59 54 48 45 43 51 56 54 58 36 40 31 29 27 20newsbbcsport k-nearest neighbor error BOW [Frakes & Baeza-Yates, 1992] TF-IDF [Jones, 1972] Okapi BM25 [Robertson & Walker, 1994] LSI [Deerwester et al., 1990] LDA [Blei et al., 2003] mSDA [Chen et al., 2012] Componential Counting Grid [Perina et al., 2013] Word Mover's Distance Figure 3. The kNN test error results on 8 document classification data sets, compared to canonical and state-of-the-art baselines methods. 1 2 3 4 5 6 7 8 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 averageerrorw.r.t.BOW 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0 1.29 1.15 1.0 0.72 0.60 0.55 0.49 0.42 BOW TF-IDF Okapi BM25 LSI LDA mSDA CCG WMD Figure 4. The kNN test errors of various document metrics aver- aged over all eight datasets, relative to kNN with BOW. w, TF(w, D) is its term frequency in document D, |D| is Table 2. Test error percentage and standard deviation for different text embeddings. NIPS, AMZ, News are word2vec (w2v) models trained on different data sets whereas HLBL and Collo were also obtained with other embedding algorithms. DOCUMENT k-NEAREST NEIGHBOR RESULTS DATASET HLBL CW NIPS AMZ NEWS (W2V) (W2V) (W2V) BBCSPORT 4.5 8.2 9.5 4.1 5.0 TWITTER 33.3 33.7 29.3 28.1 28.3 RECIPE 47.0 51.6 52.7 47.4 45.1 OHSUMED 52.0 56.2 55.6 50.4 44.5 CLASSIC 5.3 5.5 4.0 3.8 3.0 REUTERS 4.2 4.6 7.1 9.1 3.5 AMAZON 12.3 13.3 13.9 7.8 7.2 Source: From Word Embeddings To Document Distances. Kusner, Matt J. et al. 2015.
  35. 35. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A Section IV Q&A
  36. 36. INTRODUCTION WORD2VEC WORD MOVERS DISTANCE Q&A Q&A

×