Deep learning for nlp

Deep learning for Natural
language processing
Viet-Trung Tran
1

Some of the challenges in Language
Understanding
• Language is ambiguous:
– Every sentence has many possible
interpretations.
• Language is productive:
– We will always encounter new
words or new 
constructions
• Language is culturally speciﬁc 
 

2

fruit flies like a banana
NN NN VB DT NN
NN VB P DT NN
NN NN P DT NN
NN VB VB DT NN

ML: Traditional Approach
• For each new problem/question
– Gather as much LABELED data as you can get
– Throw some algorithms at it (mainly put in an SVM and 
keep it at that)
– If you actually have tried more algos: Pick the best
– Spend hours hand engineering some features / feature 
selection / dimensionality reduction (PCA, SVD, etc)
– Repeat… 
 

3

Deep Learning: Why for NLP ?
• Beat state of the art
– Language Modeling (Mikolov et al. 2011) [WSJ AR task]
– Speech Recognition (Dahl et al. 2012, Seide et al 2011; 
following Mohammed et al. 2011)
– Sentiment Classiﬁcation (Socher et al. 2011)
– MNIST hand-written digit recognition (Ciresan et al. 
2010)
– Image Recognition (Krizhevsky et al. 2012) [ImageNet] 
 

5

Language semantics
• What is the meaning of a word? 
(Lexical semantics)
• What is the meaning of a sentence? 
([Compositional] semantics)
• What is the meaning of a longer piece of
text? 
(Discourse semantics) 
 

6

One-hot encoding
•  Form vocabulary of words that maps lemmatized words to a
unique ID (position of word in vocabulary)
•  Typical vocabulary sizes will vary between 10 000 and 250
000
7

One-hot encoding
•  The one-hot vector of an ID is a vector ﬁlled with 0s, except
for a 1at the position associated with the ID
–  for vocabulary size D=10, the one-hot vector of word ID w=4 is e(w)
= [ 0 0 0 1 0 0 0 0 0 0 ]
•  A one-hot encoding makes no assumption about word
similarity
•  All words are equally diﬀerent from each other
8

Word representation
•  Standard
–  Bag of Words
–  A one-hot encoding
–  20k to 50k dimensions
–  Can be improved by
factoring in document
frequency
•  Word embedding
–  Neural Word
embeddings
–  Uses a vector space
that attempts to
predict a word given a
context window
–  200-400 dimensions
Word
embeddings
make
seman0c
similarity
and

synonyms
possible
9

Distributional representations
•  “You shall know a word by the company it
keeps” (J. R. Firth 1957)
•  One of the most successful ideas of modern
•  statistical NLP!
10

•  Word Embeddings (Bengio et al, 2001; Bengio et al,
2003) based on idea of distributed representations
for symbols (Hinton 1986)

•  Neural Word embeddings (Mnih and Hinton 2007,
Collobert & Weston 2008, Turian et al 2010;
Collobert et al. 2011, Mikolov et al.2011)
11

Neural distributional
representations
•  Neural word embeddings
•  Combine vector space semantics with the
prediction of probabilistic models
•  Words are represented as a dense vector
Human
=

12

Word embeddings
Turian,
J.,
Ra0nov,
L.,
Bengio,
Y.
(2010).
Word
representa0ons:

A
simple
and
general
method
for
semi-‐supervised
learning

14

•  What words have embeddings closest to a given word?
From Collobert et al. (2011)

16

Word Embeddings for MT: Mikolov (2013)
17

Word Embeddings
•  one of the most exciting area of research in deep learning
•  introduced by Bengio, et al. 2013
•  W:words→Rn is a paramaterized function mapping words in
some language to high-dimensional vectors (200 to 500).
–  W(‘‘cat")=(0.2, -0.4, 0.7, ...)
–  W(‘‘mat")=(0.0, 0.6, -0.1, ...)
•  Typically, the function is a lookup table, parameterized by a
matrix, θ, with a row for each word: Wθ(wn)=θn
•  W is initialized as random vectors for each word.
•  Word embedding learns to have meaningful vectors to
perform some task.

18

Learning word vectors (Collobert et al. JMLR 2011)
•  Idea: A word and its context is a positive
training example, a random word in the same
context give a negative training example

19

Example
•  Train a network for is predicting whether a 5-
gram (sequence of ﬁve words) is ‘valid.’
•  Source
– any text corpus (wikipedia)
•  Break half number of 5-grams to get negative
training examples
– Make 5-gram nonsensical
– "cat sat song the mat”

20

Neural network to determine if a 5-gram is
'valid' (Bottou 2011)
•  Look up each word in the 5-gram through W
•  Feed those into R network
•  R tries to predict if the 5-gram is 'valid' or 'invalid'
–  R(W(‘‘cat"), W(‘‘sat"), W(‘‘on"), W(‘‘the"), W(‘‘mat"))= 1
–  R(W(‘‘cat"), W(‘‘sat"), W(‘‘song"), W(‘‘the"), W(‘‘mat"))=0
•  The network needs to learn good parameters for both W
and R.
21

Idea
•  “a few people sing well” → “a couple people
sing well”
•  the validity of the sentence doesn’t change
•  if W maps synonyms (like “few” and
“couple”) close together
– R’s perspective little changes.

23

Bingo
•  The number of possible 5-grams is massive
•  But, small number of data points to learn
from
•  Similar class of words
– “the wall is blue” → “the wall is red”
•  Multiple words
– “the wall is blue” → “the ceiling is red”
•  Shifting “red” closer to “blue” makes the
network R perform better.
24

Word embedding property
•  Analogies between words encoded in the
diﬀerence vectors between words.
– W(‘‘woman")−W(‘‘man") ≃ W(‘‘aunt")−W(‘‘uncle")
– W(‘‘woman")−W(‘‘man") ≃ W(‘‘queen")−W(‘‘king")
25

Linguis0c
Regulari0es:
Mikolov
(2013)

26

Word embedding property: Shared
representations
•  The use of word representations… has become a
key “secret sauce” for the success of many NLP
systems in recent years, across tasks including
named entity recognition, part-of-speech tagging,
parsing, and semantic role labeling. (Luong et al.
(2013))

27

•  W and F learn to perform
task A. Later, G can learn
to perform B based on W

28

English – Chinese word mapping
30

Embed images and words in a single
representation
31

Feedforward neural net language model
(NNLM) Belgio et. al., 2003
• Long training time
32

Recurrent neural network based language
model (Mikolov et. al., 2010)
• Elman Network
33

Simple RNN training
• Input vector: 1-of-N encoding (one hot)
• Repeated epoch
– S(0): vector of small value (0,1)
– Hidden layer: 30 – 500 units
– All training data from corpus are sequentially presented
– Init learning rate: 0.1
– Error function
– Standard backpropagation with stochastic gradient descent
• Conversion achieved after 10 – 20 epochs
34

Word2vec (Mikolov et. al., 2013)
• Log-linear model
• Previous models: non-linear hidden layer ->
complexity
• Continuous word vectors are learned using
simple model
35

Continuous BoW (CBOW) Model
• Similar to the feed-forward NNLM, but
– Non-linear hidden layer removed
• Called CBOW (continuous BoW) because the
order of the words is lost

Continuous Skip-gram Model
• Similar to CBOW, but
– Tries to maximize classiﬁcation of a word based on
another word in the same sentence
• Predicts words within a certain window
• Observations
– Larger window size => better quality of the resulting
word vectors, higher training time
– More distant words are usually less related to the current
word than those close to it
– Give less weight to the distant words by sampling less
from those words in the training examples

Modular Network that learns word
embeddings
•  Fixed number of inputs

41

Recursive neural networks
•  Output of a module go into a module of the same
type
•  tree-structured neural networks
•  No ﬁxed number of inputs
42

Building on Word Vector Space Models
• But how can we represent the meaning of longer
phrases?
• By mapping them into the same vector space!  
 
 
 

43

How should we map phrases into a
vector space?
44

Sentence Parsing: What we want
45

Learn Structure and Representation
46

Recursive Neural Networks for 
Structure Prediction
47

Recursive Neural Network Deﬁnition
48

Recursive Application of Relational
Operators
49

Parsing a sentence with an RNN
50

Labeling in Recursive Neural Networks
54

Recursive matrix-vector model
55

Recursive neural tensor network

56

Socher et al. 2013: Sentence sentiment
analysis
57

Reversible sentence representation
BoNou
2011

•  Bilingual sentence representation
59

Credits
•  Richard
Socher, Christopher
– Stanford University
– nlp.stanford.edu/courses/NAACL2013/
•  Roelof Pieters, PhD candidate KTH/CSC
•  http://colah.github.io/
•  Bengio GSS 2012
61

Language Modeling
•  A language model is a probabilistic model that
assigns probabilities to any sequence of words
p(w1, ... ,wT)
•  Language modeling is the task of learning a
language model that assigns high probabilities to
well formed sentences
•  Plays a crucial role in speech recognition and
machine translation systems
62

N-gram models
•  An n-gram is a sequence of n words
–  unigrams(n=1):’‘is’’,‘‘a’’,‘‘sequence’’,etc.
–  bigrams(n=2): [‘‘is’’,‘‘a’’], [‘’a’’,‘‘sequence’’],etc.
–  trigrams(n=3): [‘’is’’,‘‘a’’,‘‘sequence’’],
[‘‘a’’,‘‘sequence’’,‘‘of’’],etc.
•  n-gram models estimate the conditional from n-
grams counts

•  The counts are obtained from a training corpus (a
dataset of word text)
63

Deep learning for nlp

More Related Content

What's hot

Similar to Deep learning for nlp

More from Viet-Trung TRAN

Recently uploaded

Deep learning for nlp