Transformer & Bert
Models for long sequence
How to model long sequence (LSTM)
From: https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714
How to model long sequence (CNN)
From: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
How to model long sequence (CNN)
Convolutional Sequence to Sequence Learning
Neural Machine Translation of Rare Words with Subword Units
Google's Neural Machine Translation System
Seq2seq
From: https://github.com/farizrahman4u/seq2seq
Attention Mechanism
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE
Transformer (Q, K, V)
From: http://jalammar.github.io/illustrated-transformer/
From: http://jalammar.github.io/illustrated-transformer/
Why divided sqrt(d_k) ?
What about order ?
From: http://jalammar.github.io/illustrated-transformer/
From: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
From: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Transformer (parameters)
 Multi-Head-Attention: (512 * 64 * 3 * 8) + (8 * 64 * 512)
 Feed-Forward: (512*2048) + 2048 + (2048 * 512) + 512
 Last-Linear-Layer: (512 * 370000)
 Total: Multi-Head-Attention * 3 * 6 + Feed-Forward * 2 * 6 + Last-Linear-Layer = 63 * 1e6
((512*64*3*8)+(8*64*512)) * 18 + ((512*2048)+(2048*512)+2048+512) * 12 + 512 * 37000
Transformer (FLOPS per token)
 Multi-Head-Attention: ((512+511)*64)*3*8+((512+511)*512)
 Feed-Forward: ((512+511)*2048)+2048+((2048+2047)*512)+512
 Last-Linear-Layer: ((512+511)*370000)+370000
 Total: Multi-Head-Attention * 3 * 6 + Feed-Forward * 2 * 6 + Last-Linear-Layer = 467MFLOPS
(((512+511)*64)*3*8+((512+511)*512))*18+(((512+511)*2048)+2048+((2048+2047)*512)+512)*12+((512+511)*370000)+370000
Picture from: https://www.alamy.com/stock-photo-cookie-monster-ernie-elmo-bert-grover-sesame-street-1969-30921023.html
ELMO
BERT
ERNIE
From: https://arxiv.org/pdf/1810.04805.pdf
BERT (Origin)
BERT (embedding)
From: https://arxiv.org/pdf/1810.04805.pdf
BERT (training tasks)
 Masked Language Model: masked word with the [MASK] token
 Next Sentence Prediction
BERT
 BERT-base: L=12, H=768, A=12, Total Parameters: 110M
 Batch-size: 256 sequences (256 sequences * 512 tokens = 128000 tokens/batch), for 1M
steps. 128000 * 467M FLOPS = 60 TFLOPS
 Training BER-base on 4 TPUs pod (16 TPU chips total), took 4 days to complete
 Conclusion
 Space: 440MB + 393MB = 833MB
 Speed: 173 TFLOPS per second
From Paper: Practice on Long Sequential User Behavior Modeling for Click-Through Rate Prediction
Some thoughts
 All matrix add/multiple operations (a slight bit of sin/cos/exp)
 More hardware-friendly Model
 Big Op (automatically)
 Transformer + NTM

Transformer and BERT