Detecting paraphrases using recursive autoencoders

Detecting Paraphrases Using Recursive Autoencoders 1
Machine Learning Group RCC
University of Cambridge
Feynman Liang
5 Nov, 2015
1
Socher et al., Dynamic Pooling and Unfolding Recursive Autoencoders for
Paraphrase Detection (NIPS 2011)
F. Liang Cambridge MLG RCC 5 Nov, 2015 1 / 39

Example
The judge also refused to postpone the trial date of Sept. 29.
Obus also denied a defense motion to postpone the September trial
date.

Paraphrase detection problem
Given: Sentences v1:m, w1:n ∈ V∗
Task: Classify whether v1:m and w1:n are paraphrases of each other

Applications
Plagarism Detection
Text Summarization
Information Retrieval
Re-examining Machine Translation Metrics for Paraphrase Identif cation
Nitin Madnani Joel Tetreault
Educational Testing Service
Princeton, NJ, USA
{ nmadnani,jtetreault} @ets.org

Outline
Distributed Word Representations
Unfolding recursive autoencoders
Dynamic pooling
Results
Follow-up work

Distributed word representations

Distributional semantics
Goal: Construct a represenation for language which captures semantic
meaning and is convenient for computation
From linguistics:
Lexical/compositional semantics: meaning through individual words
and syntactic constructions (WordNet, formal language theory)
Distributional semantics: meaning through statistical properties (large
datasets, linear algebra)
Distributional hypothesis2
: “A word is characterized by the company it
keeps”
One way to do so is to model join density p(w1:T ) for w1:T ∈ V∗
2
Firth, Studies in Linguistic Analysis 1957

Curse of dimensionality
English language actually has > 106 words, but for simplicity’s sake:
|V| = 105
, w1:10 ∈ V∗
How many free parameters could we potentially need to represent
p(w1:10)?

Simplifying assumption
O(|V|n) is intractable.
Dependencies between words tend to exist only within a local context
=⇒ factorization into CPDs:
P(w1:T ) ≈
T
t=1
P(wt|contextt ⊂ w1:t−1wt+1:n)
For example:
n-gram: context = wi−n+1:i−1
Continuous Bag of Words (word2vec)3: context = wi−4:i+4
3
Mikolov, ICLR 2013

Distributed representation for words
From Bengio, JMLR 2003:
1 Associate with each word in the vocabulary a distributed
word feature vector (w ∈ RD)
2 Express the joint probability function of word sequences in
terms of the feature vectors of these words in the sequence,
and
3 Learn simultaneously the word feature vectors and the
parameters of that probability function
Embedding Matrix Le : V → RD
Joint PDF P(w1:T )
P(w1:T ) =
T
t=1
P(wt|Le(wt−n+1:t−1))
What are we taking the context to be? How many parameters does
P(w1:T ) have?

Neural-network parameterization of CPD
Number free parameters ∈ O(n + D|V|)
J (θ) = 1
T t [− log P(wt|Le(wt−n+1:t−1), θ)] + R(θ)
Bengio, JMLR 2003

A “semantic” vector space
Empirically5:
Words with similar meaning are mapped close together
Directions in the vector space correspond to semantic concepts
Figure: “gender” and “singular/plural” vector oﬀsets from word analogy task
5
Mikolov, NAACL 2013

Unfolding Recursive Autoencoders

From words to sentences
Le : V → RD embeds words into a semantic vector space where the metric
approximates semantic similarity.
If instead we had a V∗ → RD for sentences, then we can measure sentence
similarity and detect paraphrases. . .

Autoencoders
Learn a compact representation
capturing regularities present in the
input
h = se(Wex + be), ˆx = sd (Wd h + bd )
min
We ,Wd
ˆx − x 2
l2
+ R(We, Wd , bd )
Denoising:
h = se(We(x + δ) + be)
Stacking
Application in DNNs6: greedy
layer-wise pretraining +
discriminative ﬁne-tuning
6
Bengio, NIPS 2007
A. Ng., CS294A Lecture notes

Recursive autoencoders for sentence embedding
(RD)∗ → RD: recursively apply RD × RD → RD
yi = f (We[xj ; xk] + b)
f , activation function
Free parameters:
We ∈ RD×2D
, encoding matrix
b ∈ RD
, bias
Anything missing from this deﬁnition?

Associativity
Associativity of the binary operation is provided by a grammatical parse
tree (e.g. obtained from CoreNLP8):
8
Klein (ACL 2003)

Training recursive autoencoders
Wd “undoes” We (minimizes square reconstruction error)
To train:
argminWe ,Wd
[x1; y1] − [x1; y1] 2
2
+ R(We, Wd )
Notice anything asymmetrical? (hint: is this even an autoencoder?)

Unfolding RAEs
Reconstruction error was only measured against a single decoding step!
Instead, recursively apply Wd to decode down to terminals
argminWe ,Wd
[xi ; . . . ; xj ] − [xi ; . . . ; xj ] 2
2
+ R(We, Wd )
Children with larger subtrees weighted more
DAG =⇒ eﬃciently optimized via back-propogation through
structure9 and L-BFGS
9
Goller, 1995

Dynamic pooling

Measuring sentence similarity
From sentence x1:N and RAE encoding y1:K , form
s = [x1, . . . , xN, y1, . . . , yK ]
For two sentences s1, s2, the similarity matrix S has entries
(S)i,j = (s1)i − (s2)j
2
2

Handling varying sentence length
Sentence lengths may vary =⇒ S dimensionality may vary.
Would like S → Spooled ∈ Rnp×np with np constant.

Pooling layers
Used in CNNs to achieve translation invariance
http://uﬂdl.stanford.edu/tutorial/supervised/ConvolutionalNeuralNetwork/

Dynamic pooling of the similarity matrix
Dynamically partition rows and columns of S into np equal parts
Min. pool (why?) over each part
Normalize µ = 0, σ = 1 and pass on to classiﬁer

Results

Qualitative evaluation of unsupervised feature learning
Dataset
150,000 sentences from NYT and AP sections of Gigaword corpus for
RAE training
Setup
R100 oﬀ-the-shelf feature vectors for word embeddings11
Stanford parser12 to extract parse tree
Baseline
Recursive average of all word vectors in parse tree
11
Turian, ACL 2010
12
Klein, ACL 2003

Nearest Spooled neighbor
Figure: Nearest 2-norm neighbor

Recursive decoding
Figure: Unfolding RAE encode/decode

Paraphrase detection task
Dataset
Microsoft Research paraphrase corpus (MSRP)13
5,801 sentence pairs, 3,900 labeled as paraphrases
13
Dolan, COLING 2004

Paraphrase detection task
Setup
4,076 training pairs (67.5% positive), 1,725 test pairs (66.5%
positive)
∀(S1, S2) ∈ D, (S2, S1) also added
Add features ∈ {0, 1} to Spooled related to the set of numbers in S1
and S2
Numbers in S1 = numbers in S2
(Numbers in S1 ∪ numbers in S2) = ∅
Numbers in one sentence ⊂ numbers in other
Softmax classiﬁer over Spooled

Example results

Numerical results
Recursive averaging: 75.9%
Standard RAE: 75.5%
Unfolding RAE: 76.8%

State of the art
“Paraphrase Identiﬁcation (State of the Art).” ACLWiki. Web. 2 Nov 2015.

Does the dynamic pooling layer add anything?
S-histogram 73.0%
Only added number features 73.2%
Only Spooled 72.6%
Top URAE Node 74.2%
Spooled + number features 76.8%
Is anything suspicious about these results?

Follow-Up Work Since 2011

Extending RAEs to capture compositionality
Recursive Matrix-Vector Spaces15
p = f We
c1
c2
+ b → p = f We
Ba + b0
Ab + a0
+ p0
15
Socher, EMNLP 2012

Encoding the parse tree using LSTMs
Tree-Structured LSTMs16
x1 x2 x3 x4
y1 y2 y3 y4
x1
x2
x4 x5 x6
y1
y2 y3
y4 y6
16
Tai, ACL 2015

Diﬀerent “semantic norms” on the word vector space
Neural Tensor Networks17
g(e1, R, e2) = uT
R f eT
1 W
[1:k]
R e2 + VR
e1
e2
+ bR
Francesco
Guicciardini
historian male
ItalyFlorence
Francesco
Patrizi
Matteo
Rosselli
profession
gender
place of birth
nationality
location nationality
nationality
gender
17
Socher, NIPS 2013

Detecting paraphrases using recursive autoencoders

More Related Content

What's hot

Viewers also liked

Similar to Detecting paraphrases using recursive autoencoders

More from Feynman Liang

Recently uploaded

Detecting paraphrases using recursive autoencoders