Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
[course site]
Day 2 Lecture 4
Word Embeddings
Word2Vec
Antonio Bonafonte
2
Christopher Olah
Visualizing Representations
3
Representation of categorical features
Newsgroup task:
Input: 1000 words (from a fixed vocabulary V, size |V|)
Variable ...
4
One-hot (one-of-n) encoding
Example: letters. |V| = 30
‘a’: xT
= [1,0,0, ..., 0]
‘b’: xT
= [0,1,0, ..., 0]
‘c’: xT
= [0,...
5
One-hot (one-of-n) encoding
Example: words.
cat: xT
= [1,0,0, ..., 0]
dog: xT
= [0,1,0, ..., 0]
.
.
mamaguy: xT
= [0,0,0...
6
One-hot encoding of words: limitations
● Large dimensionality
● Sparse representation (mostly zeros)
● Blind representat...
7
Word embeddings
● Represent words using vectors of dimension d
(~100 - 500)
● Meaningful (semantic, syntactic) distances...
8
GloVe (Stanford)
9
GloVe (Stanford)
10
Word analogy:
a is to b as c is to ….
Find d such as wd
is closest to wa
- wb
+ wc
● Athens is to Greece as Berlin to …...
11
How to define word representation?
You shall know a word by the company it keeps.
Firth, J. R. 1957
Some relevant examp...
Using NN to define embeddings
One-hot encoding + fully connected
→ embedding (projection) layer
Example:
Language Model (p...
From tensorflow word2vec tutorial: https://www.tensorflow.org/tutorials/word2vec/
13
Toy example: predict next word
Corpus:
the dog saw a cat
the dog chased a cat
the cat climbed a tree
|V| = 8
One-hot encod...
Toy example: predict next word
Architecture:
Input Layer: h1
(x)= WI
· x
Hidden Layer: h2
(x)= g(WH
· h1
(x))
Output Layer...
x = y = zeros(8,1)
x(2) = y(4) = 1
WI = rand(3,8) - 0.5;
WO = rand(8,3) - 0.5;
WH = rand(3,3);
h1 = WI * x
a2 = WH * h1 h2...
Input Layer: h1
(x)= WI
· x (projection layer)
Note that non-lineal activation function in projection layer is irrelevant
...
Hidden Layer: hH
(x)= g(WH
· h1
(x))
Softmax Layer: z(x) = o(WO
· h2
(x))
18
19
Computational complexity
Example:
Num training data: 1B
Vocabulary size: 100K
Context: 3 previous words
Embeddings: 100...
We can get embeddings implicitly from any task that
involves words.
However, good generic embeddings are good for other
ta...
How to get good embedings:
● Very large lexicon
● Huge amount of learning data
● Unsupervised (or trivial labels)
● Comput...
Architecture specific for produccing embeddings.
It is learnt with huge amount of data.
Simplify the architecture: remove ...
CBOW: Continuous Bag of Words
the cat climbed a tree
Given context:
a, cat, the, tree
Estimate prob. of
climbed
23
Skip-gram
the cat climbed a tree
Given word:
climbed
Estimate prob. of context words:
a, cat, the, tree
(It selects random...
Reduce cost: subsampling
Most frequent words (is, the) can appear hundred of millions
of times.
Less important:
Paris ~ Fr...
Simplify cost: negative sampling
Softmax is required to get probabilities. But our goal here is
just to get good embedding...
27
Word2Vec is not deep, but used in many tasks using deep
learning
There are other approaches: this is a very popular toolki...
Word2Vec
● Mikolov, Tomas; et al. "Efficient Estimation of Word
Representations in Vector Space
● Linguistic Regularities ...
Upcoming SlideShare
Loading in …5
×

Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)

826 views

Published on

https://telecombcn-dl.github.io/2017-dlsl/

Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.

The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.

Published in: Data & Analytics
  • Be the first to comment

Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)

  1. 1. [course site] Day 2 Lecture 4 Word Embeddings Word2Vec Antonio Bonafonte
  2. 2. 2 Christopher Olah Visualizing Representations
  3. 3. 3 Representation of categorical features Newsgroup task: Input: 1000 words (from a fixed vocabulary V, size |V|) Variable which can take a limited number of possible values E.g.: gender, blood types, countries, …, letters, words, phonemes Phonetics transcription (CMU DICT): Input: letters: (a b c…. z ‘ - .) (30 symbols)
  4. 4. 4 One-hot (one-of-n) encoding Example: letters. |V| = 30 ‘a’: xT = [1,0,0, ..., 0] ‘b’: xT = [0,1,0, ..., 0] ‘c’: xT = [0,0,1, ..., 0] . . . ‘.’: xT = [0,0,0, ..., 1]
  5. 5. 5 One-hot (one-of-n) encoding Example: words. cat: xT = [1,0,0, ..., 0] dog: xT = [0,1,0, ..., 0] . . mamaguy: xT = [0,0,0, …,0,1,0,...,0] . . . Number of words, |V| ? B2: 5K C2: 18K LVSR: 50-100K Wikipedia (1.6B): 400K Crawl data (42B): 2M
  6. 6. 6 One-hot encoding of words: limitations ● Large dimensionality ● Sparse representation (mostly zeros) ● Blind representation ○ Only operators: ‘!=’ and ‘==’
  7. 7. 7 Word embeddings ● Represent words using vectors of dimension d (~100 - 500) ● Meaningful (semantic, syntactic) distances ● Dominant research topic in last years in NLP conferences (EMNLP) ● Good embeddings are useful for many other tasks
  8. 8. 8 GloVe (Stanford)
  9. 9. 9 GloVe (Stanford)
  10. 10. 10 Word analogy: a is to b as c is to …. Find d such as wd is closest to wa - wb + wc ● Athens is to Greece as Berlin to …. ● Dance is to dancing as fly to …. Word similarity: Closest word to ... Evaluation of the Representation
  11. 11. 11 How to define word representation? You shall know a word by the company it keeps. Firth, J. R. 1957 Some relevant examples: Latent semantic analysis (LSA): ● Define co-ocurrence matrix of words wi in documents wj ● Apply SVD to reduce dimensionality GloVe (Global Vectors): ● Start with co-ocurrences of word wi and wj in context of wi ● Fit log-bilinear regression model of the embeddings
  12. 12. Using NN to define embeddings One-hot encoding + fully connected → embedding (projection) layer Example: Language Model (predict next word given previous words) produces word embeddings (Bengio 2003) 12
  13. 13. From tensorflow word2vec tutorial: https://www.tensorflow.org/tutorials/word2vec/ 13
  14. 14. Toy example: predict next word Corpus: the dog saw a cat the dog chased a cat the cat climbed a tree |V| = 8 One-hot encoding: a: [1,0,0,0,0,0,0,0] cat: [0,1,0,0,0,0,0,0] chased: [0,0,1,0,0,0,0,0] climbed: [0,0,0,1,0,0,0,0] dog: [0,0,0,0,1,0,0,0] saw: [0,0,0,0,0,1,0,0] the: [0,0,0,0,0,0,1,0] tree: [0,0,0,0,0,0,0,1] 14
  15. 15. Toy example: predict next word Architecture: Input Layer: h1 (x)= WI · x Hidden Layer: h2 (x)= g(WH · h1 (x)) Output Layer: z(x) = o(WO · h2 (x)) Training sample: cat → climbed 15
  16. 16. x = y = zeros(8,1) x(2) = y(4) = 1 WI = rand(3,8) - 0.5; WO = rand(8,3) - 0.5; WH = rand(3,3); h1 = WI * x a2 = WH * h1 h2 = tanh(h2) a3 = WO * h2 z3 = exp(a3) z3 = h3/sum(z3) 16
  17. 17. Input Layer: h1 (x)= WI · x (projection layer) Note that non-lineal activation function in projection layer is irrelevant 17
  18. 18. Hidden Layer: hH (x)= g(WH · h1 (x)) Softmax Layer: z(x) = o(WO · h2 (x)) 18
  19. 19. 19 Computational complexity Example: Num training data: 1B Vocabulary size: 100K Context: 3 previous words Embeddings: 100 Hidden Layers: 300 units Projection Layer: - (copy row) Hidden Layer: 300 · 300 products, 300 tanh(.) Softmax Layer: 100 · 100K products, 100K exp(.) Total: 90K + 10M !! The softmax is the network's main bottleneck.
  20. 20. We can get embeddings implicitly from any task that involves words. However, good generic embeddings are good for other tasks which may have much less data (transfer learning). Sometimes, the embeddings can be fine-tuned to the final task. Word embeddings: requirements 20
  21. 21. How to get good embedings: ● Very large lexicon ● Huge amount of learning data ● Unsupervised (or trivial labels) ● Computational efficient Word embeddings: requirements 21
  22. 22. Architecture specific for produccing embeddings. It is learnt with huge amount of data. Simplify the architecture: remove hidden layer. Simplify the cost (softmax) Two variants: ● CBOW (continuos bag of words) ● Skip-gram Word2Vec [Mikolov 2013] 22
  23. 23. CBOW: Continuous Bag of Words the cat climbed a tree Given context: a, cat, the, tree Estimate prob. of climbed 23
  24. 24. Skip-gram the cat climbed a tree Given word: climbed Estimate prob. of context words: a, cat, the, tree (It selects randomly the context length, till max of 10 left + 10 right) 24
  25. 25. Reduce cost: subsampling Most frequent words (is, the) can appear hundred of millions of times. Less important: Paris ~ France OK Paris ~ the ? → Each input word, wi , is discarded with probability 25
  26. 26. Simplify cost: negative sampling Softmax is required to get probabilities. But our goal here is just to get good embeddings. Cost function: maximize s(Wo j · WI i ) but add term negative for words which do not appear in the context (randomly selected). 26
  27. 27. 27
  28. 28. Word2Vec is not deep, but used in many tasks using deep learning There are other approaches: this is a very popular toolkit, with trained embeddings, but there are others (GloVe). Why does it works? See paper from GloVe [Pennington et al] Discussion 28
  29. 29. Word2Vec ● Mikolov, Tomas; et al. "Efficient Estimation of Word Representations in Vector Space ● Linguistic Regularities in Continuous Space Word Representations ● Tensorflow tutorial https://www.tensorflow.org/tutorials/word2vec/ GloVe: http://nlp.stanford.edu/projects/glove/ (& paper) Blog Sebastian Ruder sebastianruder.com/word-embeddings-1/ References 29

×