Recursive neural networks (RNNs) were developed to model recursive structures like images, sentences, and phrases. RNNs construct feature representations recursively from components. Later models like recursive autoencoders (RAEs), matrix-vector RNNs (MV-RNNs), and recursive neural tensor networks (RNTNs) improved on RNNs by handling unlabeled data, incorporating different composition rules, and reducing parameters. These recursive models achieved strong performance on tasks like image segmentation, sentiment analysis, and paraphrase detection.
2. Recursive Neural Network (RNN) - Motivation
• Motivation: Many real objects has a recursive structure,
e.g. Images are sum of segments, and sentences are sum of words
Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. ICML 2011.
Image from Stanford CS224N Lecture Note 14.
3. Recursive Neural Network (RNN) - Motivation
• Motivation: Can we learn a good representation for the recursive structures?
• Recursive structures (phrases) and components (words) should lie on the same space,
e.g. the country of my birth ≃ Germany, France, etc.
Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. ICML 2011.
Image from Stanford CS224N Lecture Note 14.
4. Recursive Neural Network (RNN) - Model
• Goal: Design a neural network that features are recursively constructed
• Each module maps two children to one parents, lying on the same vector space
• To give the order of recursion, we give a score (plausibility) for each node
• Hence, the neural network module outputs (representation, score) pairs
Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. ICML 2011.
Image from Stanford CS224N Lecture Note 14.
Each line is
5. Recursive Neural Network (RNN) - Model
• cf. Note that recurrent neural network is a special case of recursive neural network
Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. ICML 2011.
Image from Ratsgo’s blog for text mining.
=
6. Recursive Neural Network (RNN) - Inference
• Each step, merge adjacent two nodes
• With greedy algorithm, it only requires 𝑂(𝑁) time for inference
Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. ICML 2011.
Image from Stanford CS224N Lecture Note 14.
7. Recursive Neural Network (RNN) - Inference
• Each step, merge adjacent two nodes
• With greedy algorithm, it only requires 𝑂(𝑁) time for inference
Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. ICML 2011.
Image from Stanford CS224N Lecture Note 14.
8. Recursive Neural Network (RNN) - Inference
• Each step, merge adjacent two nodes
• With greedy algorithm, it only requires 𝑂(𝑁) time for inference
Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. ICML 2011.
Image from Stanford CS224N Lecture Note 14.
9. Recursive Neural Network (RNN) - Inference
• Each step, merge adjacent two nodes
• With greedy algorithm, it only requires 𝑂(𝑁) time for inference
Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. ICML 2011.
Image from Stanford CS224N Lecture Note 14.
10. Recursive Neural Network (RNN) - Inference
• We can apply beam search to improve the performance
• Beam search: Keep 𝑘-memory for each step (Greedy = 1-Beam search)
Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. ICML 2011.
Image from Ratsgo’s blog for text mining.
11. Recursive Neural Network (RNN) - Training
• Let (sentence, tree) pair (𝑥𝑖, 𝑦𝑖) are given
• Let 𝑠(𝑥𝑖, 𝑦) be score of tree 𝑦, sum of scores of every non-leaf nodes
• Let 𝐴(𝑥𝑖) be candidate trees (approximated by beam search)
• Then max-margin objective (maximize) is
where Δ 𝑦, 𝑦𝑖 is number of wrong subtrees
• We can also give a classification loss for each node
(use node’s feature as input for the classifier)
Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. ICML 2011.
Image from Stanford CS224N Lecture Note 14.
Increases 𝑠(𝑥𝑖, 𝑦𝑖) decreases 𝑠(𝑥𝑖, 𝑦) if 𝑠 𝑥𝑖, 𝑦 + Δ 𝑦, 𝑦𝑖 > 𝑠(𝑥𝑖, 𝑦𝑖)
class vector
12. Recursive Neural Network (RNN) - Experiments
• After training, both leaf and higher nodes learn the valid representation
• Image segmentation: Infer classes for segments (feature extractor is jointly trained)
• Phrase clustering: Nearest neighborhood on phrase features
Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. ICML 2011.
13. Recursive Neural Network (RNN) - Appendix
• Preprocessing: How to convert segments/words to the representation space ℝ 𝑛
?
• Word: Use pretrained word2vec model (𝑉 → ℝ 𝑛)
• Image: Extract hand-crafted features in ℝ 𝑚
, and jointly train a network 𝐹: ℝ 𝑚
→ ℝ 𝑛
• Extension to image segmentation
• There are multiple adjacency segments
• Hence, there are multiple true tree structures
• Hence, Δ 𝑦, 𝑦𝑖 checks if the subtree is
included in the set of true tree structures
Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. ICML 2011.
14. Recursive Autoencoder (RAE) - Motivation & Idea
• Motivation: Recursive neural network (RNN) requires true tree structures for training
• Recursive autoencoder (RAE) extends RNN to un- (semi-)supervised learning setting
• If tree structure 𝑦 is given, we can train a local autoencoder 𝑐1, 𝑐2 → 𝑝 → 𝑐1
′
, 𝑐2
′
on each node, with reconstruction loss 𝐿(𝑦) = σ 𝑐1,𝑐2,𝑝 ∈𝑦 𝑐1, 𝑐2 − 𝑐1
′
, 𝑐2
′ 2
Socher et al. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. EMNLP 2011.
15. Recursive Autoencoder (RAE) - Model
• If tree structure 𝑦 is given, we can train a local autoencoder 𝑐1, 𝑐2 → 𝑝 → 𝑐1
′
, 𝑐2
′
on each node, with reconstruction loss 𝐿(𝑦) = σ 𝑐1,𝑐2,𝑝 ∈𝑦 𝑐1, 𝑐2 − 𝑐1
′
, 𝑐2
′ 2
• If tree structure is not given, we take minimum over all candidate trees 𝐴(𝑥𝑖)
argmin
𝑦∈𝐴(𝑥 𝑖)
𝐿(𝑦) = argmin
𝑦∈𝐴(𝑥 𝑖)
𝑐1,𝑐2,𝑝 ∈𝑦
𝑐1, 𝑐2 − 𝑐1
′
, 𝑐2
′ 2
• Here, 𝐴(𝑥𝑖) is approximated by greedy search, using recon loss as score
• Length normalization: Minimizing recon loss forces the scale of hidden nodes be 0
To prevent this, normalize hidden nodes by length: 𝑝/‖𝑝‖
• The resulting tree captures the information of words, but not follows the syntactics
• However, the learnt representation was still useful
Socher et al. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. EMNLP 2011.
16. Recursive Autoencoder (RAE) - Experiments
• For each paragraph, votes on 5 sentiments are labeled (multiple votes for one paragraph)
• Train a logistic regression model using the learnt representation
• The learnt representation was better than baseline models,
e.g. binary bag-of-words, hand-crafted features, and average of word vectors
Socher et al. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. EMNLP 2011.
17. Unfolding RAE & Dynamic Pooling - Model
• Unfolding RAE is global autoencoder version of RAE (expensive but may better)
• In some tasks, e.g. paraphrase detection, we should compare features of sentences
• Comparing all features would be better than root features, but size does not match
• Dynamic pooling converts the similarity matrix to the fixed-sized matrix
Socher et al. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. NIPS 2011.
18. Unfolding RAE & Dynamic Pooling - Experiments
• Unfolding RAE learns better representation than RAE
• Unfolding RAE + dynamic pooling gives the best representation for similarity
Socher et al. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. NIPS 2011.
Nearest Neighbors
Similarity Classification
19. Matrix-Vector RNN (MV-RNN)
• Motivation: Different word pairs have different composition rule
• Idea: Represent the composition rule of words ∈ ℝ 𝑛 by a matrix ∈ ℝ 𝑛×𝑛
• Hence, each word is represented by a matrix-vector pair 𝑎, 𝐴 ∈ ℝ 𝑛
× ℝ 𝑛×𝑛
• For two words 𝑎, 𝐴 and 𝑏, 𝐵 , the parent node 𝑝, 𝑃 is given by
𝑝 = 𝑓𝑉 𝑎, 𝑏, 𝐴, 𝐵 = ሚ𝑓𝑉 𝐵𝑎, 𝐴𝑏
and
𝑃 = 𝑓 𝑀 𝐴, 𝐵 = 𝑊 𝑀 ⋅ 𝐴 𝐵 𝑇
• We should store ℝ 𝑛×𝑛×|𝑉|
matrixes, hence the authors use
low-rank approximation to reduce the # of parameters
• MV-RNN shows better performance than vanilla RNN
Socher et al. Semantic Compositionality through Recursive Matrix-Vector Spaces. EMNLP 2012.
Semantic Classification
20. Recursive Neural Tensor Network (RNTN)
• Motivation: Considering composition is cool, but MV-RNN uses too many parameters
• Instead of using one matrix for each word, use a single tensor to represent composition
• Formally, let 𝑉[1:𝑛] ∈ ℝ2𝑛×2𝑛×𝑛 where 𝑉[𝑖] ∈ ℝ2𝑛×2𝑛 indicates each tensor slices
• Then the composition rule ℎ ∈ ℝ 𝑛 for children (𝑎, 𝑏) are given by
ℎ𝑖 = 𝑎 𝑏 ⋅ 𝑉 𝑖
⋅ 𝑎 𝑏 𝑇
and the parent 𝑝 ∈ ℝ 𝑛
is
𝑝 = 𝑓 𝑎, 𝑏, ℎ = ሚ𝑓(ℎ + 𝑊 ⋅ 𝑎 𝑏 𝑇
)
• It reduced the # of parameters from 𝑑 × 𝑑 × |𝑉| to 2𝑑 × 2𝑑 × 𝑑
• RNTN also shows better performance than MV-RNN
Socher et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP 2013.
21. Reference
• Recursive Neural Network (RNN): Socher et al. Parsing Natural Scenes and Natural
Language with Recursive Neural Networks. ICML 2011.
• Recursive Autoencoder (RAE): Socher et al. Semi-Supervised Recursive Autoencoders for
Predicting Sentiment Distributions. EMNLP 2011.
• Unfolding RAE & Dynamic Pooling: Socher et al. Dynamic Pooling and Unfolding Recursive
Autoencoders for Paraphrase Detection. NIPS 2011.
• Matrix-Vector RNN (MV-RNN): Socher et al. Semantic Compositionality through Recursive
Matrix-Vector Spaces. EMNLP 2012.
• Recursive Neural Tensor Network (RNTN): Socher et al. Recursive Deep Models for
Semantic Compositionality Over a Sentiment Treebank. EMNLP 2013.