Transformer and BERT

•Download as PPTX, PDF•

0 likes•758 views

Hao(Robin) Dong

NLP by Deep Learning

Engineering

Transformer & Bert
Models for long sequence

How to model long sequence (LSTM)
From: https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714

How to model long sequence (CNN)
From: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

How to model long sequence (CNN)
Convolutional Sequence to Sequence Learning
Neural Machine Translation of Rare Words with Subword Units
Google's Neural Machine Translation System

Seq2seq
From: https://github.com/farizrahman4u/seq2seq

Attention Mechanism
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

Transformer (Q, K, V)
From: http://jalammar.github.io/illustrated-transformer/

From: http://jalammar.github.io/illustrated-transformer/
Why divided sqrt(d_k) ?

What about order ?
From: http://jalammar.github.io/illustrated-transformer/
From: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

From: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Transformer (parameters)
 Multi-Head-Attention: (512 * 64 * 3 * 8) + (8 * 64 * 512)
 Feed-Forward: (512*2048) + 2048 + (2048 * 512) + 512
 Last-Linear-Layer: (512 * 370000)
 Total: Multi-Head-Attention * 3 * 6 + Feed-Forward * 2 * 6 + Last-Linear-Layer = 63 * 1e6
((512*64*3*8)+(8*64*512)) * 18 + ((512*2048)+(2048*512)+2048+512) * 12 + 512 * 37000

Transformer (FLOPS per token)
 Multi-Head-Attention: ((512+511)*64)*3*8+((512+511)*512)
 Feed-Forward: ((512+511)*2048)+2048+((2048+2047)*512)+512
 Last-Linear-Layer: ((512+511)*370000)+370000
 Total: Multi-Head-Attention * 3 * 6 + Feed-Forward * 2 * 6 + Last-Linear-Layer = 467MFLOPS
(((512+511)*64)*3*8+((512+511)*512))*18+(((512+511)*2048)+2048+((2048+2047)*512)+512)*12+((512+511)*370000)+370000

Picture from: https://www.alamy.com/stock-photo-cookie-monster-ernie-elmo-bert-grover-sesame-street-1969-30921023.html
ELMO
BERT
ERNIE

From: https://arxiv.org/pdf/1810.04805.pdf
BERT (Origin)

BERT (embedding)
From: https://arxiv.org/pdf/1810.04805.pdf

BERT (training tasks)
 Masked Language Model: masked word with the [MASK] token
 Next Sentence Prediction

BERT
 BERT-base: L=12, H=768, A=12, Total Parameters: 110M
 Batch-size: 256 sequences (256 sequences * 512 tokens = 128000 tokens/batch), for 1M
steps. 128000 * 467M FLOPS = 60 TFLOPS
 Training BER-base on 4 TPUs pod (16 TPU chips total), took 4 days to complete
 Conclusion
 Space: 440MB + 393MB = 833MB
 Speed: 173 TFLOPS per second

From Paper: Practice on Long Sequential User Behavior Modeling for Click-Through Rate Prediction

Some thoughts
 All matrix add/multiple operations (a slight bit of sin/cos/exp)
 More hardware-friendly Model
 Big Op (automatically)
 Transformer + NTM

What's hot

4 c# programming constructsTuan Ngo

Implementation of quantum gates using verilogShashank Kumar

Avlsi qpkamal b

Losurdo Tum Seminar 18 04 08Marco Losurdo

Hidden Truths in Dead Software PathsBen Hermann

Programmable logic arrayHuba Akhtar

Advance compositing and animationpaiils111

9 d55201 testing & testabilityVinod Kumar Gorrepati

Python decision making_loops part7Vishal Dutt

Programmable Logic Array Comilla University

PAL And PLA ROMRONAK SUTARIYA

[Question Paper] Embedded System (Revised Course) [April / 2015]Mumbai B.Sc.IT Study

B.Sc.IT: Semester - VI (October - 2013) [IDOL - Revised Course | Question Paper]Mumbai B.Sc.IT Study

What's hot (13)

4 c# programming constructs

Implementation of quantum gates using verilog

Avlsi qp

Losurdo Tum Seminar 18 04 08

Hidden Truths in Dead Software Paths

Programmable logic array

Advance compositing and animation

9 d55201 testing & testability

Python decision making_loops part7

Programmable Logic Array

PAL And PLA ROM

[Question Paper] Embedded System (Revised Course) [April / 2015]

B.Sc.IT: Semester - VI (October - 2013) [IDOL - Revised Course | Question Paper]

Similar to Transformer and BERT

Intel Nervana Graph とは？Mr. Vengineer

LLaMA_Final The Meta LLM Presentation.pptxDr. Yasir Butt

Transformer ZooGrigory Sapunov

Attention Is All You NeedSEMINARGROOT

05 backpropagation automatic_differentiationAndres Mendez-Vazquez

Learn about Tensorflow for Deep Learning now! Part 1Tyrone Systems

Non-equilibrium molecular dynamics with LAMMPSAndrea Benassi

Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Fordham University

Ai meetup Neural machine translation updated2040.io

Workshop NGS data analysis - 2Maté Ongenaert

Embedded Logic Flip-Flops: A Conceptual ReviewSudhanshu Janwadkar

Digital Signal Processinf (DSP) Course OutlineMohammad Sohai Khan Niazi

What's new in c# 5.0 net pontoPaulo Morgado

Set Up & Operate Tungsten ReplicatorContinuent

Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingDatabricks

FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal

BERT QnA System for Airplane Flight ManualArkaGhosh65

Fully Interoperable Streaming of Media Resources in Heterogeneous EnvironmentsAlpen-Adria-Universität

Comp7404 ai group_project_15apr2018_v2.1paul0001

Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...Databricks

Similar to Transformer and BERT (20)

Intel Nervana Graph とは？

LLaMA_Final The Meta LLM Presentation.pptx

Transformer Zoo

Attention Is All You Need

05 backpropagation automatic_differentiation

Learn about Tensorflow for Deep Learning now! Part 1

Non-equilibrium molecular dynamics with LAMMPS

Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...

Ai meetup Neural machine translation updated

Workshop NGS data analysis - 2

Embedded Logic Flip-Flops: A Conceptual Review

Digital Signal Processinf (DSP) Course Outline

What's new in c# 5.0 net ponto

Set Up & Operate Tungsten Replicator

Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King

FORECASTING MUSIC GENRE (RNN - LSTM)

BERT QnA System for Airplane Flight Manual

Fully Interoperable Streaming of Media Resources in Heterogeneous Environments

Comp7404 ai group_project_15apr2018_v2.1

Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...

Recently uploaded

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani

FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsArindam Chakraborty, Ph.D., P.E. (CA, TX)

Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

data_management_and _data_science_cheat_sheet.pdfJiananWang21

Unit 1 - Soil Classification and Compaction.pdfRagavanV2

chapter 5.pptx: drainage and irrigation engineeringmulugeta48

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

UNIT - IV - Air Compressors and its Performancesivaprakash250

Unit 2- Effective stress & Permeability.pdfRagavanV2

Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi

Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti

KubeKraft presentation @CloudNativeHooghlysanyuktamishra911

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey

Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile

Thermal Engineering-R & A / C - unit - VDineshKumar4165

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY

Recently uploaded (20)

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record

FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads

Work-Permit-Receiver-in-Saudi-Aramco.pptx

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service

data_management_and _data_science_cheat_sheet.pdf

Unit 1 - Soil Classification and Compaction.pdf

chapter 5.pptx: drainage and irrigation engineering

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand

Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service

UNIT - IV - Air Compressors and its Performance

Unit 2- Effective stress & Permeability.pdf

Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf

KubeKraft presentation @CloudNativeHooghly

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...

Water Industry Process Automation & Control Monthly - April 2024

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...

Thermal Engineering-R & A / C - unit - V

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...

Transformer and BERT

1. Transformer & Bert Models for long sequence

2. How to model long sequence (LSTM) From: https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714

3. How to model long sequence (CNN) From: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

4. How to model long sequence (CNN) Convolutional Sequence to Sequence Learning Neural Machine Translation of Rare Words with Subword Units Google's Neural Machine Translation System

5. Seq2seq From: https://github.com/farizrahman4u/seq2seq

6. Attention Mechanism NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

7. Transformer (Q, K, V) From: http://jalammar.github.io/illustrated-transformer/

8. From: http://jalammar.github.io/illustrated-transformer/ Why divided sqrt(d_k) ?

9. What about order ? From: http://jalammar.github.io/illustrated-transformer/ From: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

10. From: https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

11. Transformer (parameters)  Multi-Head-Attention: (512 * 64 * 3 * 8) + (8 * 64 * 512)  Feed-Forward: (512*2048) + 2048 + (2048 * 512) + 512  Last-Linear-Layer: (512 * 370000)  Total: Multi-Head-Attention * 3 * 6 + Feed-Forward * 2 * 6 + Last-Linear-Layer = 63 * 1e6 ((512*64*3*8)+(8*64*512)) * 18 + ((512*2048)+(2048*512)+2048+512) * 12 + 512 * 37000

12. Transformer (FLOPS per token)  Multi-Head-Attention: ((512+511)*64)*3*8+((512+511)*512)  Feed-Forward: ((512+511)*2048)+2048+((2048+2047)*512)+512  Last-Linear-Layer: ((512+511)*370000)+370000  Total: Multi-Head-Attention * 3 * 6 + Feed-Forward * 2 * 6 + Last-Linear-Layer = 467MFLOPS (((512+511)*64)*3*8+((512+511)*512))*18+(((512+511)*2048)+2048+((2048+2047)*512)+512)*12+((512+511)*370000)+370000

13. Picture from: https://www.alamy.com/stock-photo-cookie-monster-ernie-elmo-bert-grover-sesame-street-1969-30921023.html ELMO BERT ERNIE

14. From: https://arxiv.org/pdf/1810.04805.pdf BERT (Origin)

15. BERT (embedding) From: https://arxiv.org/pdf/1810.04805.pdf

16. BERT (training tasks)  Masked Language Model: masked word with the [MASK] token  Next Sentence Prediction

17. BERT  BERT-base: L=12, H=768, A=12, Total Parameters: 110M  Batch-size: 256 sequences (256 sequences * 512 tokens = 128000 tokens/batch), for 1M steps. 128000 * 467M FLOPS = 60 TFLOPS  Training BER-base on 4 TPUs pod (16 TPU chips total), took 4 days to complete  Conclusion  Space: 440MB + 393MB = 833MB  Speed: 173 TFLOPS per second

18. From Paper: Practice on Long Sequential User Behavior Modeling for Click-Through Rate Prediction

19. Some thoughts  All matrix add/multiple operations (a slight bit of sin/cos/exp)  More hardware-friendly Model  Big Op (automatically)  Transformer + NTM

Transformer and BERT

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Transformer and BERT

Similar to Transformer and BERT (20)

More from Hao(Robin) Dong

More from Hao(Robin) Dong (9)

Recently uploaded

Recently uploaded (20)

Transformer and BERT