A Tour of Neural Sequence Generators

A Tour of Neural Sequence
Generators
Sumeet S Singh

Definition
● Problem Types
○ Autoregressive Generation (Conditional / Unconditional)
○ Transduction (Possibly Crossmodal)
○ Modalities / Domains:
■ Speech, Music, Video, Image, Language, Code, Any
■ Variable Size Outputs and/or Inputs i.e., Sequences
● Applications
○ Image Captioning and Synthesis
○ Video Captioning
○ Speech and Music Recognition and Synthesis
○ Handwriting Recognition & Synthesis
○ NLP: Machine Translation, Text Summarization, Smart Reply, Smart Compose, Question
Answer etc.
○ Image to HTML
○ … (m)any other ...
2

Core Ideas
● Model Statistical Distribution of Output Signal
○ Factorize as Product of Conditional Distributions
○ Supervised, Unsupervised, Self-Supervised
● Sequence Learners
○ RNNs (LSTM etc): Multidimensional, Multidirectional
○ CNNs: Masked / Causal, Dilated, Gated ...
○ CNNs and RNN blur
● Attention
○ Automatic Segmentation, I/O Alignment, Control Flow
○ Memory Selection
○ Pointer Networks
● Multimodal, Crossmodal
○ Raw Data
○ Less domain knowledge required 3
Pr used to be modelled as a continuous distribution e.g.
mixture of densities, who’s parameters were predicted, but is
usually just a discrete softmax these days.

Unconditioned 1D Sequence Generation
Graves, A. (2013). Generating Sequences With Recurrent Neural Networks. CoRR, abs/1308.0850.
● RNN / LSTM Based
● Skip Connections
● Models / Generates a 1D
data sequence
● Autoregressive
● Train: Teacher Forcing
● Infer: Sample

RNN Based Unconditioned Handwriting Generator

Image Completion (Decoder Only)
6
Bethge, M., & Theis, L. (2015). Generative Image Modeling Using Spatial LSTMs. NIPS.

Image Completion (Decoder Only)
Oord, Aäron van den et al. “Pixel Recurrent Neural Networks.” ICML (2016)
7

Notable Mentions
Stollenga, Marijn F. et al. “Parallel Multi-Dimensional LSTM, With Application to
Fast Biomedical Volumetric Image Segmentation.” ArXiv abs/1506.07452 (2015)
Graves, Alex et al. “Multi-dimensional Recurrent Neural Networks.” ICANN
(2007). Kalchbrenner, N., Danihelka, I., & Graves, A. (2015). Grid Long Short-Term
Memory. CoRR, abs/1507.01526.

Conditioned Image Generation (Decoder Only)
Oord, Aäron van den et al. “Conditional Image Generation with
PixelCNN Decoders.” NIPS (2016).
9
Gated
Convolution

Class Conditioned Image Generation
Oord, Aäron van den et al. “Conditional Image Generation with PixelCNN Decoders.”NIPS (2016).
10

Interpolated Embedding Vectors

Image Generation Conditioned on Embeddings

Video Description (Encoder-Decoder But with CRF
encoding)
Donahue, Jeff et al. “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description.” IEEE Transactions on Pattern
Analysis and Machine Intelligence 39 (2015): 677-691. 13

Image Captioning
Donahue, Jeff et al. “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description.” IEEE Transactions on Pattern
Analysis and Machine Intelligence 39 (2015): 677-691.

Dense Captioning (Object Detection + Captioning)
Johnson, Justin et al. “DenseCap: Fully Convolutional Localization Networks for Dense Captioning.” 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR)(2016): 4565-4574.
15

Dense Captioning (CNN, RPN, LSTM)

Text Localization for Image Retrieval
Johnson, Justin et al. “DenseCap: Fully Convolutional Localization Networks for Dense Captioning.” 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR)(2016): 4565-4574.

Speech and Music Synthesis (Dilated, Gated,
Causal 1D Convolutions) (Decoder Only)
● Wavenet (Google Assistant)
● Unconditioned Speech &
Music Generation
● Speaker Conditioned
Speech
● Speaker and Text
Conditioned Speech (TTS)
● Speech Recognition
● Takes after PixelCNN
● Raw Audio
● Demo
Oord, Aäron van den et al. “WaveNet: A Generative Model for
Raw Audio.” SSW (2016).
19

Seq2Seq Based Machine Translation
Sutskever, I., Vinyals, O., & Le, Q.V. (2014). Sequence to Sequence Learning with Neural Networks. NIPS.
20
● Encoder-Decoder architecture formalized for the first time. LSTM based.
● “End-to-end approach to sequence learning that makes minimal assumptions on the
sequence structure.”
● “Reversing the order of words in source sentences improved performance markedly.”

Sentence Embeddings of the Encoder
Sutskever, I., Vinyals, O., & Le, Q.V. (2014). Sequence to Sequence Learning with Neural Networks. NIPS.

Encoder-Decoder Based Machine Translation
Cho, Kyunghyun et al. “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.” EMNLP (2014).
22
● Published at the same time as Sutskever, I., Vinyals,
O., & Le, Q.V. (2014). Sequence to Sequence
Learning with Neural Networks. NIPS., they also
introduced the Encoder-Decoder architecture.
● Also demonstrated phrase embeddings.
Custom
Gated Unit

Parallell Encoder-Decoder (Struct Like Xformer)
Kalchbrenner, Nal et al. “Neural Machine Translation in Linear Time.” CoRRabs/1610.10099 (2016): n. pag.

Offline Handwritten Line Recognition w/ CTC
Graves, A., & Schmidhuber, J. (2008). Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS.
● Strided Convolution Pyramid with
Multidimensional and Multidirectional
LSTM ‘kernels’
● No Attention
○ Input-Out Output Alignment is a problem:
CTC
○ Handcrafted Input Segmentation
(assumes horizontal text)
● Not auto-regressive

Offline Handwriting Recognition w/ Line Attention
Bluche, T. (2016). Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition. NIPS. 25
● MDLSTM + CNN
Encoder
● MDLSTM attention:
applied only across rows
- i.e. assumes horizontal-
ish lines
● Fixed line-width (N) and
fixed # of lines (T).
● CTC used for alignment
● No EOS prediction
● Not auto-regressive

Offline Handwriting Recognition w/ Line Attention
Bluche, T. (2016). Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition. NIPS.

Offline Handwritten Paragraph Recognition w/ CTC
Bluche, Théodore and Ronaldo O. Messina. “Gated Convolutional Recurrent Neural Networks for Multilingual Handwriting
Recognition.” 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) 01 (2017): 646-651.
● Similar arch as Graves’ 2008
paper with Gated
Convolutions replacing
MDLSTMs
○ CTC for Input-Output
Alignment
○ 1D LSTMs in each direction
● Paragraph Recognition: Replace
maxpool with MDLSTM line-
attention as 2016 paper.

Online Handwriting Synthesis w/ Attention
* He didn’t call it attention back then. He just called it ‘soft window into c’. The input ‘c’ was a one-dimensional sequence. 28

Online Handwriting Synthesis w/ Attention
● Text -> Pen Coordinates
● Can be conditioned by
writer
● Called it ‘soft window’
instead of ‘attention’

Attention Based Machine Translation
Bahdanau, Dzmitry et al. “Neural
Machine Translation by Jointly
Learning to Align and Translate.” CoRR
abs/1409.0473 (2014): n. pag. 30

Image Caption Generation (CNN, Attention, LSTM)
Xu, Kelvin et al. “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” ICML (2015).
A giraffe standing in a forest with trees in the background
31

Attention Based Math Recognition
32
● Similar arch as Image Captioning
● Attention granularity is at token-level
(as opposed to line-level)
● Much more complex ‘alignment’:
multiple levels, not just lines.
● CNN Based Image Encoder
● LSTM based Latex ‘language’ model
● MLP based attention model
Singh, S.S. (2018). Teaching Machines to Code: Neural Markup Generation with Visual Attention. ArXiv, abs/1802.05415.

Step by Step Prediction and Alignment
33
Singh, S.S. (2018). Teaching Machines to Code: Neural Markup Generation with Visual Attention. ArXiv, abs/1802.05415.

Attention Based Machine Translation
Vaswani, Ashish et al. “Attention
is All you Need.” NIPS (2017).
● Parallel paths for each output position
● Soft Attention at each layer.
● Pointwise Feedforward Layers do the work.
Output embeds control and memory information.
● Similar to Wavenet arch. except: Attention is the
conv-kernel.
● Various SOTA Language Models are transformer
based (BERT, GPT-2 etc.)

Image Transformer
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). Image Transformer. ICML.

Image Transformer: Image Completion

Image Transformer: Class-Conditioned Generation

Universal Transformer
● “Combines Parallelizability of Transformers with
recurrent inductive bias of RNNs”
● + Dynamic Halting
● “outperform standard Transformers on a wide
range of algorithmic and language
understanding tasks”
● “... can be shown to be Turing-complete”
● Tasks: bAbi QA, subject-verb agreement,
LAMBDA LM (new SOTA), MT (Beats Xformer),
algorithmic (copy, reverse, add), Learning to
Execute
● Compared to: Neural GPU, Neural Turing
Machine, End-to-End Memory Networks
Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, L. (2019). Universal Transformers. ArXiv, abs/1807.03819.

Evolved Transformer
● Evolution Based Architecture Search
● Beats the hand-crafter transformer on MT. Interesting because lower layers
evolved to use convolutions (regular and separable) and gated linear units.
Otherwise it is identical to the hand-crafted transformer (Vaswani).
So, D.R., Liang, C., & Le, Q.V. (2019). The Evolved Transformer. ICML.

Concluding Remarks
● Few Powerful Techniques -> Generic Tensor-to-Tensor ML Framework
● Not Covered
○ Incorporating External Memory / Knowledge into a NNs

Thank You!
untrix.github.io/i2l

A Tour of Neural Sequence Generators

More Related Content

Similar to A Tour of Neural Sequence Generators

Recently uploaded

A Tour of Neural Sequence Generators

Editor's Notes