A Tour of Neural Sequence
Generators
Sumeet S Singh
Definition
● Problem Types
○ Autoregressive Generation (Conditional / Unconditional)
○ Transduction (Possibly Crossmodal)
○ Modalities / Domains:
■ Speech, Music, Video, Image, Language, Code, Any
■ Variable Size Outputs and/or Inputs i.e., Sequences
● Applications
○ Image Captioning and Synthesis
○ Video Captioning
○ Speech and Music Recognition and Synthesis
○ Handwriting Recognition & Synthesis
○ NLP: Machine Translation, Text Summarization, Smart Reply, Smart Compose, Question
Answer etc.
○ Image to HTML
○ … (m)any other ...
2
Core Ideas
● Model Statistical Distribution of Output Signal
○ Factorize as Product of Conditional Distributions
○ Supervised, Unsupervised, Self-Supervised
● Sequence Learners
○ RNNs (LSTM etc): Multidimensional, Multidirectional
○ CNNs: Masked / Causal, Dilated, Gated ...
○ CNNs and RNN blur
● Attention
○ Automatic Segmentation, I/O Alignment, Control Flow
○ Memory Selection
○ Pointer Networks
● Multimodal, Crossmodal
○ Raw Data
○ Less domain knowledge required 3
Pr used to be modelled as a continuous distribution e.g.
mixture of densities, who’s parameters were predicted, but is
usually just a discrete softmax these days.
Unconditioned 1D Sequence Generation
Graves, A. (2013). Generating Sequences With Recurrent Neural Networks. CoRR, abs/1308.0850.
● RNN / LSTM Based
● Skip Connections
● Models / Generates a 1D
data sequence
● Autoregressive
● Train: Teacher Forcing
● Infer: Sample
RNN Based Unconditioned Handwriting Generator
Graves, A. (2013). Generating Sequences With Recurrent Neural Networks. CoRR, abs/1308.0850.
Image Completion (Decoder Only)
6
Bethge, M., & Theis, L. (2015). Generative Image Modeling Using Spatial LSTMs. NIPS.
Image Completion (Decoder Only)
Oord, Aäron van den et al. “Pixel Recurrent Neural Networks.” ICML (2016)
7
Notable Mentions
Stollenga, Marijn F. et al. “Parallel Multi-Dimensional LSTM, With Application to
Fast Biomedical Volumetric Image Segmentation.” ArXiv abs/1506.07452 (2015)
Graves, Alex et al. “Multi-dimensional Recurrent Neural Networks.” ICANN
(2007). Kalchbrenner, N., Danihelka, I., & Graves, A. (2015). Grid Long Short-Term
Memory. CoRR, abs/1507.01526.
Conditioned Image Generation (Decoder Only)
Oord, Aäron van den et al. “Conditional Image Generation with
PixelCNN Decoders.” NIPS (2016).
9
Gated
Convolution
Class Conditioned Image Generation
Oord, Aäron van den et al. “Conditional Image Generation with PixelCNN Decoders.”NIPS (2016).
10
Interpolated Embedding Vectors
Oord, Aäron van den et al. “Conditional Image Generation with PixelCNN Decoders.”NIPS (2016).
Image Generation Conditioned on Embeddings
Oord, Aäron van den et al. “Conditional Image Generation with PixelCNN Decoders.”NIPS (2016).
Video Description (Encoder-Decoder But with CRF
encoding)
Donahue, Jeff et al. “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description.” IEEE Transactions on Pattern
Analysis and Machine Intelligence 39 (2015): 677-691. 13
Image Captioning
Donahue, Jeff et al. “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description.” IEEE Transactions on Pattern
Analysis and Machine Intelligence 39 (2015): 677-691.
Dense Captioning (Object Detection + Captioning)
Johnson, Justin et al. “DenseCap: Fully Convolutional Localization Networks for Dense Captioning.” 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR)(2016): 4565-4574.
15
Dense Captioning (CNN, RPN, LSTM)
Johnson, Justin et al. “DenseCap: Fully Convolutional Localization Networks for Dense Captioning.” 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR)(2016): 4565-4574.
Johnson, Justin et al. “DenseCap: Fully Convolutional Localization Networks for Dense Captioning.” 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR)(2016): 4565-4574.
Text Localization for Image Retrieval
Johnson, Justin et al. “DenseCap: Fully Convolutional Localization Networks for Dense Captioning.” 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR)(2016): 4565-4574.
Speech and Music Synthesis (Dilated, Gated,
Causal 1D Convolutions) (Decoder Only)
● Wavenet (Google Assistant)
● Unconditioned Speech &
Music Generation
● Speaker Conditioned
Speech
● Speaker and Text
Conditioned Speech (TTS)
● Speech Recognition
● Takes after PixelCNN
● Raw Audio
● Demo
Oord, Aäron van den et al. “WaveNet: A Generative Model for
Raw Audio.” SSW (2016).
19
Seq2Seq Based Machine Translation
Sutskever, I., Vinyals, O., & Le, Q.V. (2014). Sequence to Sequence Learning with Neural Networks. NIPS.
20
● Encoder-Decoder architecture formalized for the first time. LSTM based.
● “End-to-end approach to sequence learning that makes minimal assumptions on the
sequence structure.”
● “Reversing the order of words in source sentences improved performance markedly.”
Sentence Embeddings of the Encoder
Sutskever, I., Vinyals, O., & Le, Q.V. (2014). Sequence to Sequence Learning with Neural Networks. NIPS.
Encoder-Decoder Based Machine Translation
Cho, Kyunghyun et al. “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.” EMNLP (2014).
22
● Published at the same time as Sutskever, I., Vinyals,
O., & Le, Q.V. (2014). Sequence to Sequence
Learning with Neural Networks. NIPS., they also
introduced the Encoder-Decoder architecture.
● Also demonstrated phrase embeddings.
Custom
Gated Unit
Parallell Encoder-Decoder (Struct Like Xformer)
Kalchbrenner, Nal et al. “Neural Machine Translation in Linear Time.” CoRRabs/1610.10099 (2016): n. pag.
Offline Handwritten Line Recognition w/ CTC
Graves, A., & Schmidhuber, J. (2008). Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS.
● Strided Convolution Pyramid with
Multidimensional and Multidirectional
LSTM ‘kernels’
● No Attention
○ Input-Out Output Alignment is a problem:
CTC
○ Handcrafted Input Segmentation
(assumes horizontal text)
● Not auto-regressive
Offline Handwriting Recognition w/ Line Attention
Bluche, T. (2016). Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition. NIPS. 25
● MDLSTM + CNN
Encoder
● MDLSTM attention:
applied only across rows
- i.e. assumes horizontal-
ish lines
● Fixed line-width (N) and
fixed # of lines (T).
● CTC used for alignment
● No EOS prediction
● Not auto-regressive
Offline Handwriting Recognition w/ Line Attention
Bluche, T. (2016). Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition. NIPS.
Offline Handwritten Paragraph Recognition w/ CTC
Bluche, Théodore and Ronaldo O. Messina. “Gated Convolutional Recurrent Neural Networks for Multilingual Handwriting
Recognition.” 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) 01 (2017): 646-651.
● Similar arch as Graves’ 2008
paper with Gated
Convolutions replacing
MDLSTMs
○ CTC for Input-Output
Alignment
○ 1D LSTMs in each direction
● Paragraph Recognition: Replace
maxpool with MDLSTM line-
attention as 2016 paper.
Online Handwriting Synthesis w/ Attention
Graves, A. (2013). Generating Sequences With Recurrent Neural Networks. CoRR, abs/1308.0850.
* He didn’t call it attention back then. He just called it ‘soft window into c’. The input ‘c’ was a one-dimensional sequence. 28
Online Handwriting Synthesis w/ Attention
Graves, A. (2013). Generating Sequences With Recurrent Neural Networks. CoRR, abs/1308.0850.
● Text -> Pen Coordinates
● Can be conditioned by
writer
● Called it ‘soft window’
instead of ‘attention’
Attention Based Machine Translation
Bahdanau, Dzmitry et al. “Neural
Machine Translation by Jointly
Learning to Align and Translate.” CoRR
abs/1409.0473 (2014): n. pag. 30
Image Caption Generation (CNN, Attention, LSTM)
Xu, Kelvin et al. “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” ICML (2015).
A giraffe standing in a forest with trees in the background
31
Attention Based Math Recognition
32
● Similar arch as Image Captioning
● Attention granularity is at token-level
(as opposed to line-level)
● Much more complex ‘alignment’:
multiple levels, not just lines.
● CNN Based Image Encoder
● LSTM based Latex ‘language’ model
● MLP based attention model
Singh, S.S. (2018). Teaching Machines to Code: Neural Markup Generation with Visual Attention. ArXiv, abs/1802.05415.
Step by Step Prediction and Alignment
33
Singh, S.S. (2018). Teaching Machines to Code: Neural Markup Generation with Visual Attention. ArXiv, abs/1802.05415.
Attention Based Machine Translation
Vaswani, Ashish et al. “Attention
is All you Need.” NIPS (2017).
● Parallel paths for each output position
● Soft Attention at each layer.
● Pointwise Feedforward Layers do the work.
Output embeds control and memory information.
● Similar to Wavenet arch. except: Attention is the
conv-kernel.
● Various SOTA Language Models are transformer
based (BERT, GPT-2 etc.)
Image Transformer
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). Image Transformer. ICML.
Image Transformer: Image Completion
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). Image Transformer. ICML.
Image Transformer: Class-Conditioned Generation
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). Image Transformer. ICML.
Universal Transformer
● “Combines Parallelizability of Transformers with
recurrent inductive bias of RNNs”
● + Dynamic Halting
● “outperform standard Transformers on a wide
range of algorithmic and language
understanding tasks”
● “... can be shown to be Turing-complete”
● Tasks: bAbi QA, subject-verb agreement,
LAMBDA LM (new SOTA), MT (Beats Xformer),
algorithmic (copy, reverse, add), Learning to
Execute
● Compared to: Neural GPU, Neural Turing
Machine, End-to-End Memory Networks
Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, L. (2019). Universal Transformers. ArXiv, abs/1807.03819.
Evolved Transformer
● Evolution Based Architecture Search
● Beats the hand-crafter transformer on MT. Interesting because lower layers
evolved to use convolutions (regular and separable) and gated linear units.
Otherwise it is identical to the hand-crafted transformer (Vaswani).
So, D.R., Liang, C., & Le, Q.V. (2019). The Evolved Transformer. ICML.
Concluding Remarks
● Few Powerful Techniques -> Generic Tensor-to-Tensor ML Framework
● Not Covered
○ Incorporating External Memory / Knowledge into a NNs
Thank You!
untrix.github.io/i2l

A Tour of Neural Sequence Generators

  • 1.
    A Tour ofNeural Sequence Generators Sumeet S Singh
  • 2.
    Definition ● Problem Types ○Autoregressive Generation (Conditional / Unconditional) ○ Transduction (Possibly Crossmodal) ○ Modalities / Domains: ■ Speech, Music, Video, Image, Language, Code, Any ■ Variable Size Outputs and/or Inputs i.e., Sequences ● Applications ○ Image Captioning and Synthesis ○ Video Captioning ○ Speech and Music Recognition and Synthesis ○ Handwriting Recognition & Synthesis ○ NLP: Machine Translation, Text Summarization, Smart Reply, Smart Compose, Question Answer etc. ○ Image to HTML ○ … (m)any other ... 2
  • 3.
    Core Ideas ● ModelStatistical Distribution of Output Signal ○ Factorize as Product of Conditional Distributions ○ Supervised, Unsupervised, Self-Supervised ● Sequence Learners ○ RNNs (LSTM etc): Multidimensional, Multidirectional ○ CNNs: Masked / Causal, Dilated, Gated ... ○ CNNs and RNN blur ● Attention ○ Automatic Segmentation, I/O Alignment, Control Flow ○ Memory Selection ○ Pointer Networks ● Multimodal, Crossmodal ○ Raw Data ○ Less domain knowledge required 3 Pr used to be modelled as a continuous distribution e.g. mixture of densities, who’s parameters were predicted, but is usually just a discrete softmax these days.
  • 4.
    Unconditioned 1D SequenceGeneration Graves, A. (2013). Generating Sequences With Recurrent Neural Networks. CoRR, abs/1308.0850. ● RNN / LSTM Based ● Skip Connections ● Models / Generates a 1D data sequence ● Autoregressive ● Train: Teacher Forcing ● Infer: Sample
  • 5.
    RNN Based UnconditionedHandwriting Generator Graves, A. (2013). Generating Sequences With Recurrent Neural Networks. CoRR, abs/1308.0850.
  • 6.
    Image Completion (DecoderOnly) 6 Bethge, M., & Theis, L. (2015). Generative Image Modeling Using Spatial LSTMs. NIPS.
  • 7.
    Image Completion (DecoderOnly) Oord, Aäron van den et al. “Pixel Recurrent Neural Networks.” ICML (2016) 7
  • 8.
    Notable Mentions Stollenga, MarijnF. et al. “Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation.” ArXiv abs/1506.07452 (2015) Graves, Alex et al. “Multi-dimensional Recurrent Neural Networks.” ICANN (2007). Kalchbrenner, N., Danihelka, I., & Graves, A. (2015). Grid Long Short-Term Memory. CoRR, abs/1507.01526.
  • 9.
    Conditioned Image Generation(Decoder Only) Oord, Aäron van den et al. “Conditional Image Generation with PixelCNN Decoders.” NIPS (2016). 9 Gated Convolution
  • 10.
    Class Conditioned ImageGeneration Oord, Aäron van den et al. “Conditional Image Generation with PixelCNN Decoders.”NIPS (2016). 10
  • 11.
    Interpolated Embedding Vectors Oord,Aäron van den et al. “Conditional Image Generation with PixelCNN Decoders.”NIPS (2016).
  • 12.
    Image Generation Conditionedon Embeddings Oord, Aäron van den et al. “Conditional Image Generation with PixelCNN Decoders.”NIPS (2016).
  • 13.
    Video Description (Encoder-DecoderBut with CRF encoding) Donahue, Jeff et al. “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description.” IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015): 677-691. 13
  • 14.
    Image Captioning Donahue, Jeffet al. “Long-Term Recurrent Convolutional Networks for Visual Recognition and Description.” IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015): 677-691.
  • 15.
    Dense Captioning (ObjectDetection + Captioning) Johnson, Justin et al. “DenseCap: Fully Convolutional Localization Networks for Dense Captioning.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2016): 4565-4574. 15
  • 16.
    Dense Captioning (CNN,RPN, LSTM) Johnson, Justin et al. “DenseCap: Fully Convolutional Localization Networks for Dense Captioning.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2016): 4565-4574.
  • 17.
    Johnson, Justin etal. “DenseCap: Fully Convolutional Localization Networks for Dense Captioning.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2016): 4565-4574.
  • 18.
    Text Localization forImage Retrieval Johnson, Justin et al. “DenseCap: Fully Convolutional Localization Networks for Dense Captioning.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2016): 4565-4574.
  • 19.
    Speech and MusicSynthesis (Dilated, Gated, Causal 1D Convolutions) (Decoder Only) ● Wavenet (Google Assistant) ● Unconditioned Speech & Music Generation ● Speaker Conditioned Speech ● Speaker and Text Conditioned Speech (TTS) ● Speech Recognition ● Takes after PixelCNN ● Raw Audio ● Demo Oord, Aäron van den et al. “WaveNet: A Generative Model for Raw Audio.” SSW (2016). 19
  • 20.
    Seq2Seq Based MachineTranslation Sutskever, I., Vinyals, O., & Le, Q.V. (2014). Sequence to Sequence Learning with Neural Networks. NIPS. 20 ● Encoder-Decoder architecture formalized for the first time. LSTM based. ● “End-to-end approach to sequence learning that makes minimal assumptions on the sequence structure.” ● “Reversing the order of words in source sentences improved performance markedly.”
  • 21.
    Sentence Embeddings ofthe Encoder Sutskever, I., Vinyals, O., & Le, Q.V. (2014). Sequence to Sequence Learning with Neural Networks. NIPS.
  • 22.
    Encoder-Decoder Based MachineTranslation Cho, Kyunghyun et al. “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.” EMNLP (2014). 22 ● Published at the same time as Sutskever, I., Vinyals, O., & Le, Q.V. (2014). Sequence to Sequence Learning with Neural Networks. NIPS., they also introduced the Encoder-Decoder architecture. ● Also demonstrated phrase embeddings. Custom Gated Unit
  • 23.
    Parallell Encoder-Decoder (StructLike Xformer) Kalchbrenner, Nal et al. “Neural Machine Translation in Linear Time.” CoRRabs/1610.10099 (2016): n. pag.
  • 24.
    Offline Handwritten LineRecognition w/ CTC Graves, A., & Schmidhuber, J. (2008). Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS. ● Strided Convolution Pyramid with Multidimensional and Multidirectional LSTM ‘kernels’ ● No Attention ○ Input-Out Output Alignment is a problem: CTC ○ Handcrafted Input Segmentation (assumes horizontal text) ● Not auto-regressive
  • 25.
    Offline Handwriting Recognitionw/ Line Attention Bluche, T. (2016). Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition. NIPS. 25 ● MDLSTM + CNN Encoder ● MDLSTM attention: applied only across rows - i.e. assumes horizontal- ish lines ● Fixed line-width (N) and fixed # of lines (T). ● CTC used for alignment ● No EOS prediction ● Not auto-regressive
  • 26.
    Offline Handwriting Recognitionw/ Line Attention Bluche, T. (2016). Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition. NIPS.
  • 27.
    Offline Handwritten ParagraphRecognition w/ CTC Bluche, Théodore and Ronaldo O. Messina. “Gated Convolutional Recurrent Neural Networks for Multilingual Handwriting Recognition.” 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) 01 (2017): 646-651. ● Similar arch as Graves’ 2008 paper with Gated Convolutions replacing MDLSTMs ○ CTC for Input-Output Alignment ○ 1D LSTMs in each direction ● Paragraph Recognition: Replace maxpool with MDLSTM line- attention as 2016 paper.
  • 28.
    Online Handwriting Synthesisw/ Attention Graves, A. (2013). Generating Sequences With Recurrent Neural Networks. CoRR, abs/1308.0850. * He didn’t call it attention back then. He just called it ‘soft window into c’. The input ‘c’ was a one-dimensional sequence. 28
  • 29.
    Online Handwriting Synthesisw/ Attention Graves, A. (2013). Generating Sequences With Recurrent Neural Networks. CoRR, abs/1308.0850. ● Text -> Pen Coordinates ● Can be conditioned by writer ● Called it ‘soft window’ instead of ‘attention’
  • 30.
    Attention Based MachineTranslation Bahdanau, Dzmitry et al. “Neural Machine Translation by Jointly Learning to Align and Translate.” CoRR abs/1409.0473 (2014): n. pag. 30
  • 31.
    Image Caption Generation(CNN, Attention, LSTM) Xu, Kelvin et al. “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” ICML (2015). A giraffe standing in a forest with trees in the background 31
  • 32.
    Attention Based MathRecognition 32 ● Similar arch as Image Captioning ● Attention granularity is at token-level (as opposed to line-level) ● Much more complex ‘alignment’: multiple levels, not just lines. ● CNN Based Image Encoder ● LSTM based Latex ‘language’ model ● MLP based attention model Singh, S.S. (2018). Teaching Machines to Code: Neural Markup Generation with Visual Attention. ArXiv, abs/1802.05415.
  • 33.
    Step by StepPrediction and Alignment 33 Singh, S.S. (2018). Teaching Machines to Code: Neural Markup Generation with Visual Attention. ArXiv, abs/1802.05415.
  • 34.
    Attention Based MachineTranslation Vaswani, Ashish et al. “Attention is All you Need.” NIPS (2017). ● Parallel paths for each output position ● Soft Attention at each layer. ● Pointwise Feedforward Layers do the work. Output embeds control and memory information. ● Similar to Wavenet arch. except: Attention is the conv-kernel. ● Various SOTA Language Models are transformer based (BERT, GPT-2 etc.)
  • 35.
    Image Transformer Parmar, N.,Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). Image Transformer. ICML.
  • 36.
    Image Transformer: ImageCompletion Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). Image Transformer. ICML.
  • 37.
    Image Transformer: Class-ConditionedGeneration Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., & Tran, D. (2018). Image Transformer. ICML.
  • 38.
    Universal Transformer ● “CombinesParallelizability of Transformers with recurrent inductive bias of RNNs” ● + Dynamic Halting ● “outperform standard Transformers on a wide range of algorithmic and language understanding tasks” ● “... can be shown to be Turing-complete” ● Tasks: bAbi QA, subject-verb agreement, LAMBDA LM (new SOTA), MT (Beats Xformer), algorithmic (copy, reverse, add), Learning to Execute ● Compared to: Neural GPU, Neural Turing Machine, End-to-End Memory Networks Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, L. (2019). Universal Transformers. ArXiv, abs/1807.03819.
  • 39.
    Evolved Transformer ● EvolutionBased Architecture Search ● Beats the hand-crafter transformer on MT. Interesting because lower layers evolved to use convolutions (regular and separable) and gated linear units. Otherwise it is identical to the hand-crafted transformer (Vaswani). So, D.R., Liang, C., & Le, Q.V. (2019). The Evolved Transformer. ICML.
  • 40.
    Concluding Remarks ● FewPowerful Techniques -> Generic Tensor-to-Tensor ML Framework ● Not Covered ○ Incorporating External Memory / Knowledge into a NNs
  • 41.

Editor's Notes

  • #4 E.g. Don’t need to know Arabic to make the best Arabic handwriting recognizer in the world. Hierarchy, Depth, Size, Weight Tying, Gating Residual and Skip Connections Regularization: L1, L2, Dropout, Maxout, Increase data size Activations, (batch) normalization Parameter Initialization BPTT Optimization: SGD algorithms (Adam etc.), RL Interpretability
  • #6 Models Mixture Densities.
  • #8 Discrete Softmax Distribution
  • #10 Gated Convolutions, Residual Connections, Blind Spot Removal
  • #14 Encoder-Decoder CRF = Conditional Random Fields is a Unidirected Probabilistic Graphical Model
  • #17 CNN Encodes Image Localization Layer (RPN) produces a fixed number of regions (B) LSTM produces caption for each of the B regions RPN instead of attention (though authors claim that RPN is same as attention but not true. Their RPN is way more complex than an attention mechanism).
  • #20 Raw audio : 16,000 samples per second, 1 dimensional. Takes after PixelCNN Makes 1 dimensional Adds Dilated Convolutions, Adds Skip Connections Uses Raw Audio, not log mel-filterbank energies or mel-frequency cepstral coefficients (MFCCs),
  • #24 Transformer like structure Uses CNN Residual Blocks as well as Residual Multiplicative (Gated) Units (borrowed from Video Pixel) Also can wrap RNNs for both encoder and decoder, but still has parallel paths between encoder and decoder.
  • #25 https://distill.pub/2017/ctc/
  • #39 Very interesting twist on Transformer. Claims that although a Transformer is a viable alternative to an RNN, it lacks their inductive bias towards learning iterative or recursive transformations. Therefore the Transformer does not generalize well to input lengths not encountered during training. This architecture aims to fix that by making the transformer stack’s height pointwise dynamic - i.e., the transformer stack height is different at each point. Seems very promising and seminal. Beats original Transformer in MT.