Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Transformer and BERT

Download to read offline

NLP by Deep Learning

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Transformer and BERT

  1. 1. Transformer & Bert Models for long sequence
  2. 2. How to model long sequence (LSTM) From: https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714
  3. 3. How to model long sequence (CNN) From: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
  4. 4. How to model long sequence (CNN) Convolutional Sequence to Sequence Learning Neural Machine Translation of Rare Words with Subword Units Google's Neural Machine Translation System
  5. 5. Seq2seq From: https://github.com/farizrahman4u/seq2seq
  6. 6. Attention Mechanism NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE
  7. 7. Transformer (Q, K, V) From: http://jalammar.github.io/illustrated-transformer/
  8. 8. From: http://jalammar.github.io/illustrated-transformer/ Why divided sqrt(d_k) ?
  9. 9. What about order ? From: http://jalammar.github.io/illustrated-transformer/ From: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
  10. 10. From: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
  11. 11. Transformer (parameters)  Multi-Head-Attention: (512 * 64 * 3 * 8) + (8 * 64 * 512)  Feed-Forward: (512*2048) + 2048 + (2048 * 512) + 512  Last-Linear-Layer: (512 * 370000)  Total: Multi-Head-Attention * 3 * 6 + Feed-Forward * 2 * 6 + Last-Linear-Layer = 63 * 1e6 ((512*64*3*8)+(8*64*512)) * 18 + ((512*2048)+(2048*512)+2048+512) * 12 + 512 * 37000
  12. 12. Transformer (FLOPS per token)  Multi-Head-Attention: ((512+511)*64)*3*8+((512+511)*512)  Feed-Forward: ((512+511)*2048)+2048+((2048+2047)*512)+512  Last-Linear-Layer: ((512+511)*370000)+370000  Total: Multi-Head-Attention * 3 * 6 + Feed-Forward * 2 * 6 + Last-Linear-Layer = 467MFLOPS (((512+511)*64)*3*8+((512+511)*512))*18+(((512+511)*2048)+2048+((2048+2047)*512)+512)*12+((512+511)*370000)+370000
  13. 13. Picture from: https://www.alamy.com/stock-photo-cookie-monster-ernie-elmo-bert-grover-sesame-street-1969-30921023.html ELMO BERT ERNIE
  14. 14. From: https://arxiv.org/pdf/1810.04805.pdf BERT (Origin)
  15. 15. BERT (embedding) From: https://arxiv.org/pdf/1810.04805.pdf
  16. 16. BERT (training tasks)  Masked Language Model: masked word with the [MASK] token  Next Sentence Prediction
  17. 17. BERT  BERT-base: L=12, H=768, A=12, Total Parameters: 110M  Batch-size: 256 sequences (256 sequences * 512 tokens = 128000 tokens/batch), for 1M steps. 128000 * 467M FLOPS = 60 TFLOPS  Training BER-base on 4 TPUs pod (16 TPU chips total), took 4 days to complete  Conclusion  Space: 440MB + 393MB = 833MB  Speed: 173 TFLOPS per second
  18. 18. From Paper: Practice on Long Sequential User Behavior Modeling for Click-Through Rate Prediction
  19. 19. Some thoughts  All matrix add/multiple operations (a slight bit of sin/cos/exp)  More hardware-friendly Model  Big Op (automatically)  Transformer + NTM

NLP by Deep Learning

Views

Total views

614

On Slideshare

0

From embeds

0

Number of embeds

509

Actions

Downloads

2

Shares

0

Comments

0

Likes

0

×