BERT: Bidirectional Encoder Representations from Transformers

•

4 likes•1,992 views

BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.

Science

1
BERT: Bidirectional Encoder
Representations from Transformers
Liangqun Lu
MS in CS and PhD in Biology
2019 - 02 - 25
Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT:
Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv [cs.CL].
arXiv. http://arxiv.org/abs/1810.04805.

Related Previous Work
● Attention: Neural Machine Translation by Jointly Learning to Align and
Translate (Bahdanau et al. 2014)
● Transformer: Attention is All you Need (Vaswani et al. 2017)
● ELMo: Deep Contextualized Word Representations (Peters et al. 2018)
● GPT: Improving language understanding by generative pre-training (Radford
et al. 2018)
2
Seq2seq NMT Attention Transformer
Bert
Glove ELMo GPTWord2Vec

Sequence to sequence neural network
● Many NLP tasks can be phrased as sequence-to-sequence:
○ Language translation (input → output)
○ Summarization (long text → short text)
○ Dialogue (previous utterances → next utterance)
○ Parsing (input text → output parse as sequence)
○ Code generation (natural language → Python code)
3
Encoder DecoderInput Output

NMT: Neural machine translation
4
● 2 RNN models are involved: Encoder and Decoder

Pros and cons of NMT
● Pros:
○ Better performance than previous statistical-based machine translation
○ Requires much less human engineering effort
○ A single neural network to be optimized end-to-end
● Cons:
○ less interpretable
○ difficult to control (can’t easily specify rules or guidelines for translation)
○ Information bottleneck
6

8
Attention provides a solution to the
bottleneck problem: each step of the
decoder, focus on a particular part of the
source sequence

Attention is great !
● Attention significantly improves NMT performance
● Attention helps with vanishing gradient problem
● Attention provides some interpretability
○ By inspecting attention distribution, we can
see the alignment between words which
shows that the neural network learns the
alignment
14
Attention is a way to focus on particular parts of the
input; Improves sequence-to-sequence a lot

Attention is a general Deep Learning technique
● More general definition of attention:
● Given a set of vector values, and a vector query, attention is a
technique to compute a weighted sum of the values, dependent on the
query.
● For example, in the seq2seq + attention model, each decoder hidden state
attends to the encoder hidden states.
15

● Intuition:
● The weighted sum is a selective summary of the information
contained in the values, where the query determines which values to
focus on.
● Attention is a way to obtain a fixed-size representation of an arbitrary
set of representations (the values), dependent on some other
representation (the query).
16

Transformer Overview
● Sequence-to-sequence Encoder to
Decoder
● Task: machine translation with parallel
corpus
● Predict each translated word
● Final cost/error function is standard
cross-entropy error on top of a softmax
classifier
17

Bert outline
● Contextual word representations
● Masked language model
● Next sentence prediction
● Model architecture
● Experiments
a. Sentence Pair Classification [MNLI]
b. Single Sentence Classification [SST-2]
c. Question Answering [SQuAD]
d. Single Sentence Tagging [CoNLL-NER]
24

SQuAD -- Stanford Question Answering Dataset
41

Conclusion
● BERT is strong pre-trained language model that uses bidirectional
transformer
● BERT can be fine-tuned to achieve good performance in many NLP tasks
● The source code is available at github
44

References
● Stanford CS224n: Natural Language Processing with Deep Learning
● Stanford CS231n: Convolutional Neural Networks for Visual Recognition
● http://people.ee.duke.edu/~lcarin/Kevin8.3.2018.pdf
● https://zhuanlan.zhihu.com/p/52282552
● https://zhuanlan.zhihu.com/p/46178084
● https://zhuanlan.zhihu.com/p/39034683
46

What's hot

Transformers AI PPT.pptxRahulKumar854607

BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong

Natural language processing and transformer modelsDing Li

Deep learning for NLP and TransformerArvind Devaraj

Introduction to Transformer ModelNuwan Sriyantha Bandara

BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M

BERTKhang Pham

NLP using transformers Arvind Devaraj

TransformersAnup Joseph

Introduction to Named Entity RecognitionTomer Lieber

Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev

Transformer ZooGrigory Sapunov

BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim

Thomas Wolf "Transfer learning in NLP"Fwdays

Introduction to Recurrent Neural NetworkYan Xu

[Paper review] BERTJEE HYUN PARK

Transformer Introduction (Seminar Material)Yuta Niki

Word embeddings, RNN, GRU and LSTMDivya Gera

Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev

A Review of Deep Contextualized Word Representations (Peters+, 2018)Shuntaro Yada

What's hot (20)

Transformers AI PPT.pptx

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Natural language processing and transformer models

Deep learning for NLP and Transformer

Introduction to Transformer Model

BERT - Part 1 Learning Notes of Senthil Kumar

BERT

NLP using transformers

Transformers

Introduction to Named Entity Recognition

Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)

Transformer Zoo

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Thomas Wolf "Transfer learning in NLP"

Introduction to Recurrent Neural Network

[Paper review] BERT

Transformer Introduction (Seminar Material)

Word embeddings, RNN, GRU and LSTM

Introduction to Transformers for NLP - Olga Petrova

A Review of Deep Contextualized Word Representations (Peters+, 2018)

Similar to BERT: Bidirectional Encoder Representations from Transformers

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail

Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Universitat Politècnica de Catalunya

Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Vimukthi Wickramasinghe

A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHIRJET Journal

Notes on attention mechanismKhang Pham

Natural Language Processing - Research and Application TrendsShreyas Suresh Rao

Nlp and transformer (v3s)H K Yoon

[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...IJET - International Journal of Engineering and Techniques

Arabic named entity recognition using deep learning approachIJECEIAES

TensorflowKnoldus Inc.

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...kevig

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...kevig

EXPERIMENTS ON DIFFERENT RECURRENT NEURAL NETWORKS FOR ENGLISH-HINDI MACHINE ...csandit

Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya

Fast and Accurate Preordering for SMT using Neural NetworksSDL

EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONijaia

BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig

BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc

BERT Explained_ State of the art language model for NLP.pdfsudeshnakundu10

Similar to BERT: Bidirectional Encoder Representations from Transformers (20)

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF

Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...

Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...

A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH

Notes on attention mechanism

Natural Language Processing - Research and Application Trends

Nlp and transformer (v3s)

[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...

Arabic named entity recognition using deep learning approach

Tensorflow

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...

EXPERIMENTS ON DIFFERENT RECURRENT NEURAL NETWORKS FOR ENGLISH-HINDI MACHINE ...

Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)

Fast and Accurate Preordering for SMT using Neural Networks

EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION

BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...

BERT Explained_ State of the art language model for NLP.pdf

Recently uploaded

Introduction,importance and scope of horticulture.pptxBhagirath Gogikar

pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197

Forensic Biology & Its biological significance.pdfrohankumarsinghrore1

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju

Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju

Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi

Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav

GBSN - Microbiology (Unit 2)Areesha Ahmad

STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONrouseeyyy

module for grade 9 for distance learninglevieagacer

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74

Bacterial Identification and ClassificationsAreesha Ahmad

Seismic Method Estimate velocity from seismic data.pptxAlMamun560346

Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385

COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)AkefAfaneh2

GBSN - Microbiology (Unit 1)Areesha Ahmad

Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee

Recently uploaded (20)

Introduction,importance and scope of horticulture.pptx

pumpkin fruit fly, water melon fruit fly, cucumber fruit fly

SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL

Forensic Biology & Its biological significance.pdf

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf

Pests of cotton_Sucking_Pests_Dr.UPR.pdf

Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking

Zoology 5th semester notes( Sumit_yadav).pdf

GBSN - Microbiology (Unit 2)

STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION

module for grade 9 for distance learning

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...

Bacterial Identification and Classifications

Seismic Method Estimate velocity from seismic data.pptx

Pulmonary drug delivery system M.pharm -2nd sem P'ceutics

COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)

GBSN - Microbiology (Unit 1)

Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)

BERT: Bidirectional Encoder Representations from Transformers

1. 1 BERT: Bidirectional Encoder Representations from Transformers Liangqun Lu MS in CS and PhD in Biology 2019 - 02 - 25 Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1810.04805.

2. Related Previous Work ● Attention: Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al. 2014) ● Transformer: Attention is All you Need (Vaswani et al. 2017) ● ELMo: Deep Contextualized Word Representations (Peters et al. 2018) ● GPT: Improving language understanding by generative pre-training (Radford et al. 2018) 2 Seq2seq NMT Attention Transformer Bert Glove ELMo GPTWord2Vec

3. Sequence to sequence neural network ● Many NLP tasks can be phrased as sequence-to-sequence: ○ Language translation (input → output) ○ Summarization (long text → short text) ○ Dialogue (previous utterances → next utterance) ○ Parsing (input text → output parse as sequence) ○ Code generation (natural language → Python code) 3 Encoder DecoderInput Output

4. NMT: Neural machine translation 4 ● 2 RNN models are involved: Encoder and Decoder

5. NMT training 5

6. Pros and cons of NMT ● Pros: ○ Better performance than previous statistical-based machine translation ○ Requires much less human engineering effort ○ A single neural network to be optimized end-to-end ● Cons: ○ less interpretable ○ difficult to control (can’t easily specify rules or guidelines for translation) ○ Information bottleneck 6

7. 7

8. 8 Attention provides a solution to the bottleneck problem: each step of the decoder, focus on a particular part of the source sequence

9. 9

10. 10

11. 11

12. 12

13. 13

14. Attention is great ! ● Attention significantly improves NMT performance ● Attention helps with vanishing gradient problem ● Attention provides some interpretability ○ By inspecting attention distribution, we can see the alignment between words which shows that the neural network learns the alignment 14 Attention is a way to focus on particular parts of the input; Improves sequence-to-sequence a lot

15. Attention is a general Deep Learning technique ● More general definition of attention: ● Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query. ● For example, in the seq2seq + attention model, each decoder hidden state attends to the encoder hidden states. 15

16. ● Intuition: ● The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on. ● Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query). 16

17. Transformer Overview ● Sequence-to-sequence Encoder to Decoder ● Task: machine translation with parallel corpus ● Predict each translated word ● Final cost/error function is standard cross-entropy error on top of a softmax classifier 17

18. Scaled Dot-Production Attention 18

19. 19

20. 20

21. 21

22. 22

23. 23

24. Bert outline ● Contextual word representations ● Masked language model ● Next sentence prediction ● Model architecture ● Experiments a. Sentence Pair Classification [MNLI] b. Single Sentence Classification [SST-2] c. Question Answering [SQuAD] d. Single Sentence Tagging [CoNLL-NER] 24

25.

26.

27.

28.

29.

30.

31.

32.

33.

34. 34

35. 35

36.

37.

38.

39.

40.

41. SQuAD -- Stanford Question Answering Dataset 41

42.

43. 43 SQuAD1.1 Leaderboard

44. Conclusion ● BERT is strong pre-trained language model that uses bidirectional transformer ● BERT can be fine-tuned to achieve good performance in many NLP tasks ● The source code is available at github 44

45.

46. References ● Stanford CS224n: Natural Language Processing with Deep Learning ● Stanford CS231n: Convolutional Neural Networks for Visual Recognition ● http://people.ee.duke.edu/~lcarin/Kevin8.3.2018.pdf ● https://zhuanlan.zhihu.com/p/52282552 ● https://zhuanlan.zhihu.com/p/46178084 ● https://zhuanlan.zhihu.com/p/39034683 46

BERT: Bidirectional Encoder Representations from Transformers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BERT: Bidirectional Encoder Representations from Transformers

Similar to BERT: Bidirectional Encoder Representations from Transformers (20)

More from Liangqun Lu

More from Liangqun Lu (13)

Recently uploaded

Recently uploaded (20)

BERT: Bidirectional Encoder Representations from Transformers