How to model long sequence (LSTM)
From: https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714
How to model long sequence (CNN)
From: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
How to model long sequence (CNN)
Convolutional Sequence to Sequence Learning
Neural Machine Translation of Rare Words with Subword Units
Google's Neural Machine Translation System
BERT (training tasks)
Masked Language Model: masked word with the [MASK] token
Next Sentence Prediction
BERT
BERT-base: L=12, H=768, A=12, Total Parameters: 110M
Batch-size: 256 sequences (256 sequences * 512 tokens = 128000 tokens/batch), for 1M
steps. 128000 * 467M FLOPS = 60 TFLOPS
Training BER-base on 4 TPUs pod (16 TPU chips total), took 4 days to complete
Conclusion
Space: 440MB + 393MB = 833MB
Speed: 173 TFLOPS per second
From Paper: Practice on Long Sequential User Behavior Modeling for Click-Through Rate Prediction
Some thoughts
All matrix add/multiple operations (a slight bit of sin/cos/exp)
More hardware-friendly Model
Big Op (automatically)
Transformer + NTM