Deep learning for Natural
language processing
Viet-Trung Tran
1	
  
Some of the challenges in Language
Understanding
• Language is ambiguous:
– Every sentence has many possible
interpretations.
• Language is productive:
– We will always encounter new
words or new

constructions
• Language is culturally specific




2	
  
fruit flies like a banana
NN NN VB DT NN
NN VB P DT NN
NN NN P DT NN
NN VB VB DT NN
ML: Traditional Approach
• For each new problem/question
– Gather as much LABELED data as you can get
– Throw some algorithms at it (mainly put in an SVM and

keep it at that)
– If you actually have tried more algos: Pick the best
– Spend hours hand engineering some features / feature

selection / dimensionality reduction (PCA, SVD, etc)
– Repeat…




3	
  
Deep learning vs the rest
4	
  
Deep Learning: Why for NLP ?
• Beat state of the art
– Language Modeling (Mikolov et al. 2011) [WSJ AR task]
– Speech Recognition (Dahl et al. 2012, Seide et al 2011;

following Mohammed et al. 2011)
– Sentiment Classification (Socher et al. 2011)
– MNIST hand-written digit recognition (Ciresan et al.

2010)
– Image Recognition (Krizhevsky et al. 2012) [ImageNet]




5	
  
Language semantics
• What is the meaning of a word?

(Lexical semantics)
• What is the meaning of a sentence?

([Compositional] semantics)
• What is the meaning of a longer piece of
text?

(Discourse semantics)




6	
  
One-hot encoding
•  Form vocabulary of words that maps lemmatized words to a
unique ID (position of word in vocabulary)
•  Typical vocabulary sizes will vary between 10 000 and 250
000
7	
  
One-hot encoding
•  The one-hot vector of an ID is a vector filled with 0s, except
for a 1at the position associated with the ID
–  for vocabulary size D=10, the one-hot vector of word ID w=4 is e(w)
= [ 0 0 0 1 0 0 0 0 0 0 ]
•  A one-hot encoding makes no assumption about word
similarity
•  All words are equally different from each other
8	
  
Word representation
•  Standard
–  Bag of Words
–  A one-hot encoding
–  20k to 50k dimensions
–  Can be improved by
factoring in document
frequency
•  Word embedding
–  Neural Word
embeddings
–  Uses a vector space
that attempts to
predict a word given a
context window
–  200-400 dimensions
Word	
  embeddings	
  make	
  seman0c	
  similarity	
  and	
  
synonyms	
  possible	
   9	
  
Distributional representations
•  “You shall know a word by the company it
keeps” (J. R. Firth 1957)
•  One of the most successful ideas of modern
•  statistical NLP!
10	
  
•  Word Embeddings (Bengio et al, 2001; Bengio et al,
2003) based on idea of distributed representations
for symbols (Hinton 1986)

•  Neural Word embeddings (Mnih and Hinton 2007,
Collobert & Weston 2008, Turian et al 2010;
Collobert et al. 2011, Mikolov et al.2011)
11	
  
Neural distributional
representations
•  Neural word embeddings
•  Combine vector space semantics with the
prediction of probabilistic models
•  Words are represented as a dense vector
Human	
  =	
  	
  
12	
  
Vector space model
13	
  
Word embeddings
Turian,	
  J.,	
  Ra0nov,	
  L.,	
  Bengio,	
  Y.	
  (2010).	
  Word	
  representa0ons:	
  	
  
A	
  simple	
  and	
  general	
  method	
  for	
  semi-­‐supervised	
  learning	
  
14	
  
15	
  
•  What words have embeddings closest to a given word?
From Collobert et al. (2011)

16	
  
Word Embeddings for MT: Mikolov (2013)
17	
  
Word Embeddings
•  one of the most exciting area of research in deep learning
•  introduced by Bengio, et al. 2013
•  W:words→Rn is a paramaterized function mapping words in
some language to high-dimensional vectors (200 to 500). 
–  W(‘‘cat")=(0.2, -0.4, 0.7, ...)
–  W(‘‘mat")=(0.0, 0.6, -0.1, ...)
•  Typically, the function is a lookup table, parameterized by a
matrix, θ, with a row for each word: Wθ(wn)=θn
•  W is initialized as random vectors for each word. 
•  Word embedding learns to have meaningful vectors to
perform some task.


 18	
  
Learning word vectors (Collobert et al. JMLR 2011)
•  Idea: A word and its context is a positive
training example, a random word in the same
context give a negative training example


19	
  
Example
•  Train a network for is predicting whether a 5-
gram (sequence of five words) is ‘valid.’ 
•  Source
– any text corpus (wikipedia)
•  Break half number of 5-grams to get negative
training examples
– Make 5-gram nonsensical
– "cat sat song the mat”


 20	
  
Neural network to determine if a 5-gram is
'valid' (Bottou 2011)
•  Look up each word in the 5-gram through W
•  Feed those into R network
•  R tries to predict if the 5-gram is 'valid' or 'invalid'
–  R(W(‘‘cat"), W(‘‘sat"), W(‘‘on"), W(‘‘the"), W(‘‘mat"))= 1
–  R(W(‘‘cat"), W(‘‘sat"), W(‘‘song"), W(‘‘the"), W(‘‘mat"))=0
•  The network needs to learn good parameters for both W
and R.
21	
  
22	
  
Idea
•  “a few people sing well” → “a couple people
sing well”
•  the validity of the sentence doesn’t change
•  if W maps synonyms (like “few” and
“couple”) close together
– R’s perspective little changes.


23	
  
Bingo
•  The number of possible 5-grams is massive
•  But, small number of data points to learn
from
•  Similar class of words
– “the wall is blue” → “the wall is red”
•  Multiple words
– “the wall is blue” → “the ceiling is red”
•  Shifting “red” closer to “blue” makes the
network R perform better.
24	
  
Word embedding property
•  Analogies between words encoded in the
difference vectors between words.
– W(‘‘woman")−W(‘‘man") ≃ W(‘‘aunt")−W(‘‘uncle")
– W(‘‘woman")−W(‘‘man") ≃ W(‘‘queen")−W(‘‘king")
25	
  
Linguis0c	
  Regulari0es:	
  Mikolov	
  (2013)	
  
26	
  
Word embedding property: Shared
representations
•  The use of word representations… has become a
key “secret sauce” for the success of many NLP
systems in recent years, across tasks including
named entity recognition, part-of-speech tagging,
parsing, and semantic role labeling. (Luong et al.
(2013))

27	
  
•  W and F learn to perform
task A. Later, G can learn
to perform B based on W

28	
  
Bilingual word-embedding
29	
  
English – Chinese word mapping
30	
  
Embed images and words in a single
representation
31	
  
Feedforward neural net language model
(NNLM) Belgio et. al., 2003
• Long training time
32	
  
Recurrent neural network based language
model (Mikolov et. al., 2010)
• Elman Network
33	
  
Simple RNN training
• Input vector: 1-of-N encoding (one hot)
• Repeated epoch
– S(0): vector of small value (0,1)
– Hidden layer: 30 – 500 units
– All training data from corpus are sequentially presented
– Init learning rate: 0.1
– Error function
– Standard backpropagation with stochastic gradient descent
• Conversion achieved after 10 – 20 epochs
34	
  
Word2vec (Mikolov et. al., 2013)
• Log-linear model
• Previous models: non-linear hidden layer ->
complexity
• Continuous word vectors are learned using
simple model
35	
  
Continuous BoW (CBOW) Model
• Similar to the feed-forward NNLM, but
– Non-linear hidden layer removed 
• Called CBOW (continuous BoW) because the
order of the words is lost
CBOW Model
Continuous Skip-gram Model
• Similar to CBOW, but 
– Tries to maximize classification of a word based on
another word in the same sentence
• Predicts words within a certain window
• Observations
– Larger window size => better quality of the resulting
word vectors, higher training time
– More distant words are usually less related to the current
word than those close to it
– Give less weight to the distant words by sampling less
from those words in the training examples
Continuous Skip-gram Model
RECURSIVE NEURAL
NETWORKS
40	
  
Modular Network that learns word
embeddings 
•  Fixed number of inputs 

41	
  
Recursive neural networks
•  Output of a module go into a module of the same
type
•  tree-structured neural networks
•  No fixed number of inputs
42	
  
Building on Word Vector Space Models
• But how can we represent the meaning of longer
phrases? 
• By mapping them into the same vector space! 








43	
  
How should we map phrases into a
vector space?
44	
  
Sentence Parsing: What we want
45	
  
Learn Structure and Representation
46	
  
Recursive Neural Networks for

Structure Prediction
47	
  
Recursive Neural Network Definition
48	
  
Recursive Application of Relational
Operators
49	
  
Parsing a sentence with an RNN
50	
  
Parsing a sentence
51	
  
Parsing a sentence
52	
  
Parsing a sentence
53	
  
Labeling in Recursive Neural Networks
54	
  
Recursive matrix-vector model
55	
  
Recursive neural tensor network 

56	
  
Socher et al. 2013: Sentence sentiment
analysis
57	
  
Neural tensor network
58	
  
Reversible sentence representation
BoNou	
  2011	
  
•  Bilingual sentence representation
59	
  
Cho et al. (2014)
60	
  
Credits
•  Richard
Socher, Christopher
– Stanford University
– nlp.stanford.edu/courses/NAACL2013/
•  Roelof Pieters, PhD candidate KTH/CSC
•  http://colah.github.io/
•  Bengio GSS 2012
61	
  
Language Modeling
•  A language model is a probabilistic model that
assigns probabilities to any sequence of words
p(w1, ... ,wT)
•  Language modeling is the task of learning a
language model that assigns high probabilities to
well formed sentences
•  Plays a crucial role in speech recognition and
machine translation systems
62	
  
N-gram models
•  An n-gram is a sequence of n words
–  unigrams(n=1):’‘is’’,‘‘a’’,‘‘sequence’’,etc.
–  bigrams(n=2): [‘‘is’’,‘‘a’’], [‘’a’’,‘‘sequence’’],etc.
–  trigrams(n=3): [‘’is’’,‘‘a’’,‘‘sequence’’],
[‘‘a’’,‘‘sequence’’,‘‘of’’],etc.
•  n-gram models estimate the conditional from n-
grams counts


•  The counts are obtained from a training corpus (a
dataset of word text)
63	
  

Deep learning for nlp

  • 1.
    Deep learning forNatural language processing Viet-Trung Tran 1  
  • 2.
    Some of thechallenges in Language Understanding • Language is ambiguous: – Every sentence has many possible interpretations. • Language is productive: – We will always encounter new words or new
 constructions • Language is culturally specific
 
 2   fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN
  • 3.
    ML: Traditional Approach • Foreach new problem/question – Gather as much LABELED data as you can get – Throw some algorithms at it (mainly put in an SVM and
 keep it at that) – If you actually have tried more algos: Pick the best – Spend hours hand engineering some features / feature
 selection / dimensionality reduction (PCA, SVD, etc) – Repeat…
 
 3  
  • 4.
    Deep learning vsthe rest 4  
  • 5.
    Deep Learning: Whyfor NLP ? • Beat state of the art – Language Modeling (Mikolov et al. 2011) [WSJ AR task] – Speech Recognition (Dahl et al. 2012, Seide et al 2011;
 following Mohammed et al. 2011) – Sentiment Classification (Socher et al. 2011) – MNIST hand-written digit recognition (Ciresan et al.
 2010) – Image Recognition (Krizhevsky et al. 2012) [ImageNet]
 
 5  
  • 6.
    Language semantics • What isthe meaning of a word?
 (Lexical semantics) • What is the meaning of a sentence?
 ([Compositional] semantics) • What is the meaning of a longer piece of text?
 (Discourse semantics)
 
 6  
  • 7.
    One-hot encoding •  Formvocabulary of words that maps lemmatized words to a unique ID (position of word in vocabulary) •  Typical vocabulary sizes will vary between 10 000 and 250 000 7  
  • 8.
    One-hot encoding •  Theone-hot vector of an ID is a vector filled with 0s, except for a 1at the position associated with the ID –  for vocabulary size D=10, the one-hot vector of word ID w=4 is e(w) = [ 0 0 0 1 0 0 0 0 0 0 ] •  A one-hot encoding makes no assumption about word similarity •  All words are equally different from each other 8  
  • 9.
    Word representation •  Standard – Bag of Words –  A one-hot encoding –  20k to 50k dimensions –  Can be improved by factoring in document frequency •  Word embedding –  Neural Word embeddings –  Uses a vector space that attempts to predict a word given a context window –  200-400 dimensions Word  embeddings  make  seman0c  similarity  and   synonyms  possible   9  
  • 10.
    Distributional representations •  “Youshall know a word by the company it keeps” (J. R. Firth 1957) •  One of the most successful ideas of modern •  statistical NLP! 10  
  • 11.
    •  Word Embeddings(Bengio et al, 2001; Bengio et al, 2003) based on idea of distributed representations for symbols (Hinton 1986) •  Neural Word embeddings (Mnih and Hinton 2007, Collobert & Weston 2008, Turian et al 2010; Collobert et al. 2011, Mikolov et al.2011) 11  
  • 12.
    Neural distributional representations •  Neuralword embeddings •  Combine vector space semantics with the prediction of probabilistic models •  Words are represented as a dense vector Human  =     12  
  • 13.
  • 14.
    Word embeddings Turian,  J.,  Ra0nov,  L.,  Bengio,  Y.  (2010).  Word  representa0ons:     A  simple  and  general  method  for  semi-­‐supervised  learning   14  
  • 15.
  • 16.
    •  What wordshave embeddings closest to a given word? From Collobert et al. (2011) 16  
  • 17.
    Word Embeddings forMT: Mikolov (2013) 17  
  • 18.
    Word Embeddings •  oneof the most exciting area of research in deep learning •  introduced by Bengio, et al. 2013 •  W:words→Rn is a paramaterized function mapping words in some language to high-dimensional vectors (200 to 500). –  W(‘‘cat")=(0.2, -0.4, 0.7, ...) –  W(‘‘mat")=(0.0, 0.6, -0.1, ...) •  Typically, the function is a lookup table, parameterized by a matrix, θ, with a row for each word: Wθ(wn)=θn •  W is initialized as random vectors for each word. •  Word embedding learns to have meaningful vectors to perform some task. 18  
  • 19.
    Learning word vectors(Collobert et al. JMLR 2011) •  Idea: A word and its context is a positive training example, a random word in the same context give a negative training example 19  
  • 20.
    Example •  Train anetwork for is predicting whether a 5- gram (sequence of five words) is ‘valid.’ •  Source – any text corpus (wikipedia) •  Break half number of 5-grams to get negative training examples – Make 5-gram nonsensical – "cat sat song the mat” 20  
  • 21.
    Neural network todetermine if a 5-gram is 'valid' (Bottou 2011) •  Look up each word in the 5-gram through W •  Feed those into R network •  R tries to predict if the 5-gram is 'valid' or 'invalid' –  R(W(‘‘cat"), W(‘‘sat"), W(‘‘on"), W(‘‘the"), W(‘‘mat"))= 1 –  R(W(‘‘cat"), W(‘‘sat"), W(‘‘song"), W(‘‘the"), W(‘‘mat"))=0 •  The network needs to learn good parameters for both W and R. 21  
  • 22.
  • 23.
    Idea •  “a fewpeople sing well” → “a couple people sing well” •  the validity of the sentence doesn’t change •  if W maps synonyms (like “few” and “couple”) close together – R’s perspective little changes. 23  
  • 24.
    Bingo •  The numberof possible 5-grams is massive •  But, small number of data points to learn from •  Similar class of words – “the wall is blue” → “the wall is red” •  Multiple words – “the wall is blue” → “the ceiling is red” •  Shifting “red” closer to “blue” makes the network R perform better. 24  
  • 25.
    Word embedding property • Analogies between words encoded in the difference vectors between words. – W(‘‘woman")−W(‘‘man") ≃ W(‘‘aunt")−W(‘‘uncle") – W(‘‘woman")−W(‘‘man") ≃ W(‘‘queen")−W(‘‘king") 25  
  • 26.
  • 27.
    Word embedding property:Shared representations •  The use of word representations… has become a key “secret sauce” for the success of many NLP systems in recent years, across tasks including named entity recognition, part-of-speech tagging, parsing, and semantic role labeling. (Luong et al. (2013)) 27  
  • 28.
    •  W andF learn to perform task A. Later, G can learn to perform B based on W 28  
  • 29.
  • 30.
    English – Chineseword mapping 30  
  • 31.
    Embed images andwords in a single representation 31  
  • 32.
    Feedforward neural netlanguage model (NNLM) Belgio et. al., 2003 • Long training time 32  
  • 33.
    Recurrent neural networkbased language model (Mikolov et. al., 2010) • Elman Network 33  
  • 34.
    Simple RNN training • Inputvector: 1-of-N encoding (one hot) • Repeated epoch – S(0): vector of small value (0,1) – Hidden layer: 30 – 500 units – All training data from corpus are sequentially presented – Init learning rate: 0.1 – Error function – Standard backpropagation with stochastic gradient descent • Conversion achieved after 10 – 20 epochs 34  
  • 35.
    Word2vec (Mikolov et.al., 2013) • Log-linear model • Previous models: non-linear hidden layer -> complexity • Continuous word vectors are learned using simple model 35  
  • 36.
    Continuous BoW (CBOW)Model • Similar to the feed-forward NNLM, but – Non-linear hidden layer removed • Called CBOW (continuous BoW) because the order of the words is lost
  • 37.
  • 38.
    Continuous Skip-gram Model • Similarto CBOW, but – Tries to maximize classification of a word based on another word in the same sentence • Predicts words within a certain window • Observations – Larger window size => better quality of the resulting word vectors, higher training time – More distant words are usually less related to the current word than those close to it – Give less weight to the distant words by sampling less from those words in the training examples
  • 39.
  • 40.
  • 41.
    Modular Network thatlearns word embeddings •  Fixed number of inputs 41  
  • 42.
    Recursive neural networks • Output of a module go into a module of the same type •  tree-structured neural networks •  No fixed number of inputs 42  
  • 43.
    Building on WordVector Space Models • But how can we represent the meaning of longer phrases? • By mapping them into the same vector space! 
 
 
 
 43  
  • 44.
    How should wemap phrases into a vector space? 44  
  • 45.
  • 46.
    Learn Structure andRepresentation 46  
  • 47.
    Recursive Neural Networksfor
 Structure Prediction 47  
  • 48.
    Recursive Neural NetworkDefinition 48  
  • 49.
    Recursive Application ofRelational Operators 49  
  • 50.
    Parsing a sentencewith an RNN 50  
  • 51.
  • 52.
  • 53.
  • 54.
    Labeling in RecursiveNeural Networks 54  
  • 55.
  • 56.
  • 57.
    Socher et al.2013: Sentence sentiment analysis 57  
  • 58.
  • 59.
    Reversible sentence representation BoNou  2011   •  Bilingual sentence representation 59  
  • 60.
    Cho et al.(2014) 60  
  • 61.
    Credits •  Richard Socher, Christopher – StanfordUniversity – nlp.stanford.edu/courses/NAACL2013/ •  Roelof Pieters, PhD candidate KTH/CSC •  http://colah.github.io/ •  Bengio GSS 2012 61  
  • 62.
    Language Modeling •  Alanguage model is a probabilistic model that assigns probabilities to any sequence of words p(w1, ... ,wT) •  Language modeling is the task of learning a language model that assigns high probabilities to well formed sentences •  Plays a crucial role in speech recognition and machine translation systems 62  
  • 63.
    N-gram models •  Ann-gram is a sequence of n words –  unigrams(n=1):’‘is’’,‘‘a’’,‘‘sequence’’,etc. –  bigrams(n=2): [‘‘is’’,‘‘a’’], [‘’a’’,‘‘sequence’’],etc. –  trigrams(n=3): [‘’is’’,‘‘a’’,‘‘sequence’’], [‘‘a’’,‘‘sequence’’,‘‘of’’],etc. •  n-gram models estimate the conditional from n- grams counts •  The counts are obtained from a training corpus (a dataset of word text) 63