A brief survey of current deep learning/neural network methods currently used in NLP: recurrent networks (LSTM, GRU), recursive networks, convolutional networks, hybrid architectures, attention models. We will look at specific papers in the literature, targeting sentiment analysis, text classification and other tasks.
PE 459 LECTURE 2- natural gas basic concepts and properties
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
1. A Survey of Current
Neural Network
Architectures for NLP
Márton Miháltz
Meltwater Group
Hungarian NLP Meetup
2. 2
• Introduction
• Short intro to NN concepts
• Recurrent neural networks
• Long Short-Term Memory, Gated Recurrent Unit
• Recursive neural networks
• Applications to sentiment analysis: Socher et al. 2013; Tai et al. 2015
• Convolutional neural networks
• Applications to text classification: Kim 2014
• Some more recent architectures
• Memory networks, attention models, hybrid architectures
• Tools
• Theano, Torch, Tensor Flow, Caffe, Keras
Outline
3. 3
• Feed-forward neural network
• Activation fn: tanh, ReLU,
Leaky/Parametric ReLU, SoftPlus, …
• Logistic regression or softmax
function for classification layer
• Loss functions (objectives):
categorical cross-entropy, neg. log
likelihood, …
• Training (optimizers): Gradient
Descent, SGD, Mini-batch GD,
RMSprop, Ada, Adagrad, Adam,
Adamax, Nesterov Momentum,
L-BFGS, …
Very Short Intro to Modern Neural Networks
• Input embeddings
• 1-hot encoding
• Random vectors
• Pre-trained vectors, eg. distributional similarity
4. 4
● Tutorials, Blogs
○ Denny Britz’s blog (RNNs, CNNs for NLP, code etc.) -- code in Theano, Tensor Flow
○ Cristopher Olah’s blog (architectures, DL for NLP etc.)
○ Andrej Karpathy’s fun blogpost about RNNs: generate Shakespeare, Paul Graham text,
LaTex source, C code etc. + nice LSTM activity visualizations
○ Deeplearning.net Tutorial -- code in Theano (python)
● Courses
○ Richard Socher’s course Deep Learning for Natural Language Processing at Stanford --
code in Tensor Flow
○ Stanford Unsupervised Feature Learning and Deep Learning Tutorial -- code in Matlab
○ Stanford course Convolutional Neural Networks for Image Recognition (Andrej Karpathy)
● Other sources
○ Bengio’s Deep Learning book
Further Reading (DL for NLP)
5. 5
• Powerful apparatus for learning complex functions for ML
• Better at certain NLP tasks than previous methods
• Pre-trained distributed representation vectors
• Word2vec, GloVe, GenSim, doc2vec, skip-thought vectors etc.
• Vector space properties: similarity, analogies, compositionality etc.
• Less feature engineering needed
• Network learns abstract representations
• Transfer learning / domain adaptation
• Joint learning/execution of NLP steps possible
• Easy to go multimodal
Why Deep Learning for NLP?
6. 6
● About RNNs
○ Internal state depends on state of last step
○ Good for sequential input
○ Backprop. Through Time (BPTT) training
● Applications
○ Language modeling (eg. in machine translation)
○ Sequential labeling
○ Text generation (eg. image description generation, together w/ CNN)
● Problems with RNNs
○ Long sentences, long-term dependencies
○ Exponentially shrinking gradients (“vanishing gradients”)
○ Solutions:
■ Initialization of weights; regularization; using ReLU activ. fn.
■ RNN variations: bidirectional RNN, deep RNN etc.
■ gated RNNs: LSTM, GRU
Recurrent Neural Networks
7. 7
• Long Short Term Memory Networks
• A special recurrent network
• Has a memory cell (internal memory) (c)
• 3 gates: input, forget, output
sigmoid layers with pointwise multiplication
operation (vector of values in [0, 1])
• LSTM is able to remove or add information to the
cell state, regulated by gates, which optionally let
information through
• Gated Recurrent Units
• Another RNN variant
• No internal memory different from internal state
• 2 gates: reset, update (z)
• Reset gate: how to combine new input with previous
state, update gate: how much of the previous state
to keep
LSTMs and GRUs
t-1 t-1
t-1 t-1
[Chung et al. 2014
+ red labels by me]
8. 8
• Overcome RNNs’ long dependency limitations
& vanishing gradients problem
• Very hip in current NLP applications, eg. SOTA in MT
• More complex architectures:
• Bi-directional LSTM
• Stacked (deep) (B-)LSTM/GRU layers
• Another extension, Grid-LSTM (Kalchbrenner et al. 2015)
• Still evolving!
• LSTM vs. GRU better: still in the jury
• GRU has fewer parameters, may be faster to train
• LSTM may be better with more data
LSTMs and GRUs
9. 9
• About RNNs
• Hierarchical architecture
• Shared weights
• Plausible approach for modeling linguistics structures
• Sentiment Analysis with Recursive Networks (Socher et al. 2013)
• Compositional processing of parsed input (Eg. able to handle negations)
• Performs sentence-level sentiment classification:
Rotten Tomatoes dataset (Pang & Lee 2005): 11K movie review sentences pos or neg
85.5% Accuracy on binary class subset, 45.7% on 5-class
• Not SOTA score any more, but was first to go over 80% after 7 years
• Sentiment Treebank for training
Recursive Networks
10. 10
• Sentence words: embedding layer w/ random initial vectors (d=25..35)
• Parse nodes: compositionality function computes representation, recursive
• Softmax classifier: pos-neg (or 5-class) label for each word & each parse node
Recursive Neural Tensor Network
● Weight tensor V:
● Intuition:
each slice of the tensor
captures a specific
type of composition
12. 12
• Tree-LSTM
• Using constituency parsing
• Using GloVe word vectors, updated during training
• Idea: sum hidden states of child vectors
of tree nodes
• Each child has its own forget gate
• Polarity softmax classifiers on tree nodes
• Improves Socher et al 2013
• Fine-grained sentence sentiment: 51.0% vs. 45.7%
• Binary sentence sentiment: 88.0% vs. 85.4%
Tree-LSTMs for Sentiment Analysis
(Tai et al 2015)
13. 13
Convolutional Neural Networks
• CNNs (ConvNets) widely used in
image processing
• Location invariety
• Compositionality
• Fast
• Convolution layers
• “sliding window” over input representation:
filter/kernel/feature generator
• Local connectivity
• Sharing weights
• Hyperparameters
• Wide vs. narrow convolution (padding)
• Filter size (width, height, depth)
• Number of filters/layer
• Stride size
• Channels (R, G, B)
14. 14
CNNs for Text Classification
● Intuition: filter windows over
sentence words <-> n-grams
● Advantage over Recursive
NN/Tree-LSTM: does not require
parsing
● Becoming a standard baseline for
new text classification architectures
● Easy to parallelize on GPUs
15. 15
CNN for Sentiment Analysis (Kim 2014)
• Sentence polarity classification (RT dataset/Sentiment Treebank)
• 88.1% on binary sentiment classification
• Use word2vec vectors
• sentences: concatenated word vectors
• 2 channels:
• Static word2vec vectors & tuned via backprop
• Multiple window sizes (h=3,4,5) and multiple filters (eg. 100)
• Apply max-pooling on feature map
• Selects most important feature from feature map
• Penultimate layer: final feature vector
• Concatenate all pooled features
• Final layer: softmax classifier (pos/neg sentiment)
• Regularization: dropout on penultimate layer
• Randomly set to 0 some of the feature weights
• Prevents co-adaptation of hidden units during forward propagation (overfitting)
17. 17
• Recursive NNs
• Linguistically plausible, applicable to grammatical structures,
needs parsing
• Recurrent NNs
• Engineered for sequential input, current improvements with gated
RNNs (LSTM, GRU etc.)
• Convolutional NNs
• Exceptionally good for classification; unclear how to incorporate
phrase-level structures, hard to interpret, needs zero padding,
good for GPUs
Summary
18. 18
• Memory Networks
• MemN2N (Sukhbaatar et al 2015)
Facebook’s bAbI Question Answering tasks 90-90%
• Dynamic Memory Networks (Kumar, Irsoy et al 2015): Sentiment on RT dataset 88.6%
Episodic memory: input sequences, questions, reasoning about answers
• Attention models
• Parsing (Vinyals & Hinton et al 2015); Machine Translation (Bahdanau & Bengio et al 2016)
• Relation extraction with LSTM + attention (Zhou et al 2016)
• Sentence embeddings with attention model (Wang et al 2016)
• Hybrid architectures
• NER with BLSTM-CNN (Chiu & Nichols 2016): 91.62% CoNLL, 86.28% OntoNotes
• Sequential labeling with BLSTM-CNN-CRF (Ma & Hovy 2016): 97.55% PoS, 91.21% NER
• Sentiment Analysis using CNN-LSTM (Wang et al 2016)
• Joint learning of NLP tasks
• Pos-tagging, chunking and CC-tagging with one network (Søgaard & Goldberg 2016)
• JEDI: Joint learning of NER and RE (Kirschnick et al 2016)
Some Recent Work
19. 19
● Cuda, CudNN
○ You need these drivers installed
to utilize the GPU (Nvidia)
● Theano
○ Low level abstraction; you define
symbolic variables & functions;
python
● Tensor Flow
○ Low level abstraction; you define
data flow graphs; C++, python
● Torch
○ High abstraction level; very easy
C interfacing, Lua
Tools for Hacking ● Caffe
○ Very high level, simple JSON
config, little versatility, most useful
with convnets (C+Python to
extend)
● High-level wrappers
○ Keras: can bind to either Tensor
Flow or Theano; python
○ SkFlow: wrapper around Tensor
Flow for those familiar with
Scikit-learn; python
○ Pretty Tensor, TensorFlow
Slim: high level wrapper functions
for Tensor Flow; python
○ Digits: Supports Caffe and Torch
● More
○ nice overview here