BERT: Bidirectional Encoder Representations from Transformers

•

4 likes•2,023 views

BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.

Science

1
BERT: Bidirectional Encoder
Representations from Transformers
Liangqun Lu
MS in CS and PhD in Biology
2019 - 02 - 25
Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT:
Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv [cs.CL].
arXiv. http://arxiv.org/abs/1810.04805.

Related Previous Work
● Attention: Neural Machine Translation by Jointly Learning to Align and
Translate (Bahdanau et al. 2014)
● Transformer: Attention is All you Need (Vaswani et al. 2017)
● ELMo: Deep Contextualized Word Representations (Peters et al. 2018)
● GPT: Improving language understanding by generative pre-training (Radford
et al. 2018)
2
Seq2seq NMT Attention Transformer
Bert
Glove ELMo GPTWord2Vec

Sequence to sequence neural network
● Many NLP tasks can be phrased as sequence-to-sequence:
○ Language translation (input → output)
○ Summarization (long text → short text)
○ Dialogue (previous utterances → next utterance)
○ Parsing (input text → output parse as sequence)
○ Code generation (natural language → Python code)
3
Encoder DecoderInput Output

NMT: Neural machine translation
4
● 2 RNN models are involved: Encoder and Decoder

Pros and cons of NMT
● Pros:
○ Better performance than previous statistical-based machine translation
○ Requires much less human engineering effort
○ A single neural network to be optimized end-to-end
● Cons:
○ less interpretable
○ difficult to control (can’t easily specify rules or guidelines for translation)
○ Information bottleneck
6

8
Attention provides a solution to the
bottleneck problem: each step of the
decoder, focus on a particular part of the
source sequence

Attention is great !
● Attention significantly improves NMT performance
● Attention helps with vanishing gradient problem
● Attention provides some interpretability
○ By inspecting attention distribution, we can
see the alignment between words which
shows that the neural network learns the
alignment
14
Attention is a way to focus on particular parts of the
input; Improves sequence-to-sequence a lot

Attention is a general Deep Learning technique
● More general definition of attention:
● Given a set of vector values, and a vector query, attention is a
technique to compute a weighted sum of the values, dependent on the
query.
● For example, in the seq2seq + attention model, each decoder hidden state
attends to the encoder hidden states.
15

● Intuition:
● The weighted sum is a selective summary of the information
contained in the values, where the query determines which values to
focus on.
● Attention is a way to obtain a fixed-size representation of an arbitrary
set of representations (the values), dependent on some other
representation (the query).
16

Transformer Overview
● Sequence-to-sequence Encoder to
Decoder
● Task: machine translation with parallel
corpus
● Predict each translated word
● Final cost/error function is standard
cross-entropy error on top of a softmax
classifier
17

Bert outline
● Contextual word representations
● Masked language model
● Next sentence prediction
● Model architecture
● Experiments
a. Sentence Pair Classification [MNLI]
b. Single Sentence Classification [SST-2]
c. Question Answering [SQuAD]
d. Single Sentence Tagging [CoNLL-NER]
24

SQuAD -- Stanford Question Answering Dataset
41

Conclusion
● BERT is strong pre-trained language model that uses bidirectional
transformer
● BERT can be fine-tuned to achieve good performance in many NLP tasks
● The source code is available at github
44

References
● Stanford CS224n: Natural Language Processing with Deep Learning
● Stanford CS231n: Convolutional Neural Networks for Visual Recognition
● http://people.ee.duke.edu/~lcarin/Kevin8.3.2018.pdf
● https://zhuanlan.zhihu.com/p/52282552
● https://zhuanlan.zhihu.com/p/46178084
● https://zhuanlan.zhihu.com/p/39034683
46

What's hot

BERTMohd Shukri Hasan

Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia

BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M

1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow

BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong

Gpt1 and 2 model reviewSeoung-Ho Choi

Transformer Introduction (Seminar Material)Yuta Niki

Natural language processing and transformer modelsDing Li

BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim

BERT (v3).pptxakram596384

[AIoTLab]attention mechanism.pptxTuCaoMinh2

Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev

Introduction to Transformer ModelNuwan Sriyantha Bandara

Understanding GloVeJEE HYUN PARK

Glove global vectors for word representationhyunyoung Lee

Attention Is All You NeedIllia Polosukhin

Attention is All You Need (Transformer)Jeong-Gwan Lee

A Simple Introduction to Word EmbeddingsBhaskar Mitra

Notes on attention mechanismKhang Pham

TransformersAnup Joseph

What's hot (20)

BERT

Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)

BERT - Part 1 Learning Notes of Senthil Kumar

1909 BERT: why-and-how (CODE SEMINAR)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Gpt1 and 2 model review

Transformer Introduction (Seminar Material)

Natural language processing and transformer models

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT (v3).pptx

[AIoTLab]attention mechanism.pptx

Introduction to Transformers for NLP - Olga Petrova

Introduction to Transformer Model

Understanding GloVe

Glove global vectors for word representation

Attention Is All You Need

Attention is All You Need (Transformer)

A Simple Introduction to Word Embeddings

Notes on attention mechanism

Transformers

Similar to BERT: Bidirectional Encoder Representations from Transformers

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail

Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Universitat Politècnica de Catalunya

Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Vimukthi Wickramasinghe

A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHIRJET Journal

Natural Language Processing - Research and Application TrendsShreyas Suresh Rao

Nlp and transformer (v3s)H K Yoon

[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...IJET - International Journal of Engineering and Techniques

NLP using transformers Arvind Devaraj

Arabic named entity recognition using deep learning approachIJECEIAES

TensorflowKnoldus Inc.

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...kevig

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...kevig

EXPERIMENTS ON DIFFERENT RECURRENT NEURAL NETWORKS FOR ENGLISH-HINDI MACHINE ...csandit

Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya

Fast and Accurate Preordering for SMT using Neural NetworksSDL

EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONijaia

BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig

BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc

BERT Explained_ State of the art language model for NLP.pdfsudeshnakundu10

Similar to BERT: Bidirectional Encoder Representations from Transformers (20)

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF

Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...

Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...

A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH

Natural Language Processing - Research and Application Trends

Nlp and transformer (v3s)

[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...

NLP using transformers

Arabic named entity recognition using deep learning approach

Tensorflow

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...

EXPERIMENTS ON DIFFERENT RECURRENT NEURAL NETWORKS FOR ENGLISH-HINDI MACHINE ...

Neural Machine Translation (D2L10 Insight@DCU Machine Learning Workshop 2017)

Fast and Accurate Preordering for SMT using Neural Networks

EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION

BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...

BERT Explained_ State of the art language model for NLP.pdf

Recently uploaded

Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1

BLOOD AND BLOOD COMPONENT- introduction to blood physiologyNoelManyise1

ESR_factors_affect-clinic significance-Pathysiology.pptxmuralinath2

Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani

Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Sérgio Sacani

SAMPLING.pptx for analystical chemistry sample techniquesrodneykiptoo8

(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation

The ASGCT Annual Meeting was packed with exciting progress in the field advan...Health Advances

Citrus Greening Disease and its Managementsubedisuryaofficial

In silico drugs analogue design: novobiocin analogues.pptxAlaminAfendy1

NuGOweek 2024 Ghent - programme - final versionpablovgd

GBSN - Biochemistry (Unit 5) Chemistry of LipidsAreesha Ahmad

FAIRSpectra - Towards a common data file format for SIMS imagesAlex Henderson

EY - Supply Chain Services 2018_template.pptxAlguinaldoKong

GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptxSultanMuhammadGhauri

Mammalian Pineal Body Structure and Also FunctionsYOGESH DOGRA

THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani

RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGAADYARAJPANDEY1

Anemia_ different types_causes_ conditionsmuralinath2

GEOLOGICAL FIELD REPORT On Kaptai Rangamati Road-Cut Section.pdfUniversity of Barishal

Recently uploaded (20)

Cancer cell metabolism: special Reference to Lactate Pathway

BLOOD AND BLOOD COMPONENT- introduction to blood physiology

ESR_factors_affect-clinic significance-Pathysiology.pptx

Multi-source connectivity as the driver of solar wind variability in the heli...

Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...

SAMPLING.pptx for analystical chemistry sample techniques

(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...

The ASGCT Annual Meeting was packed with exciting progress in the field advan...

Citrus Greening Disease and its Management

In silico drugs analogue design: novobiocin analogues.pptx

NuGOweek 2024 Ghent - programme - final version

GBSN - Biochemistry (Unit 5) Chemistry of Lipids

FAIRSpectra - Towards a common data file format for SIMS images

EY - Supply Chain Services 2018_template.pptx

GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx

Mammalian Pineal Body Structure and Also Functions

THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.

RNA INTERFERENCE: UNRAVELING GENETIC SILENCING

Anemia_ different types_causes_ conditions

GEOLOGICAL FIELD REPORT On Kaptai Rangamati Road-Cut Section.pdf

BERT: Bidirectional Encoder Representations from Transformers

1. 1 BERT: Bidirectional Encoder Representations from Transformers Liangqun Lu MS in CS and PhD in Biology 2019 - 02 - 25 Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1810.04805.

2. Related Previous Work ● Attention: Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al. 2014) ● Transformer: Attention is All you Need (Vaswani et al. 2017) ● ELMo: Deep Contextualized Word Representations (Peters et al. 2018) ● GPT: Improving language understanding by generative pre-training (Radford et al. 2018) 2 Seq2seq NMT Attention Transformer Bert Glove ELMo GPTWord2Vec

3. Sequence to sequence neural network ● Many NLP tasks can be phrased as sequence-to-sequence: ○ Language translation (input → output) ○ Summarization (long text → short text) ○ Dialogue (previous utterances → next utterance) ○ Parsing (input text → output parse as sequence) ○ Code generation (natural language → Python code) 3 Encoder DecoderInput Output

4. NMT: Neural machine translation 4 ● 2 RNN models are involved: Encoder and Decoder

5. NMT training 5

6. Pros and cons of NMT ● Pros: ○ Better performance than previous statistical-based machine translation ○ Requires much less human engineering effort ○ A single neural network to be optimized end-to-end ● Cons: ○ less interpretable ○ difficult to control (can’t easily specify rules or guidelines for translation) ○ Information bottleneck 6

7. 7

8. 8 Attention provides a solution to the bottleneck problem: each step of the decoder, focus on a particular part of the source sequence

9. 9

10. 10

11. 11

12. 12

13. 13

14. Attention is great ! ● Attention significantly improves NMT performance ● Attention helps with vanishing gradient problem ● Attention provides some interpretability ○ By inspecting attention distribution, we can see the alignment between words which shows that the neural network learns the alignment 14 Attention is a way to focus on particular parts of the input; Improves sequence-to-sequence a lot

15. Attention is a general Deep Learning technique ● More general definition of attention: ● Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query. ● For example, in the seq2seq + attention model, each decoder hidden state attends to the encoder hidden states. 15

16. ● Intuition: ● The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on. ● Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query). 16

17. Transformer Overview ● Sequence-to-sequence Encoder to Decoder ● Task: machine translation with parallel corpus ● Predict each translated word ● Final cost/error function is standard cross-entropy error on top of a softmax classifier 17

18. Scaled Dot-Production Attention 18

19. 19

20. 20

21. 21

22. 22

23. 23

24. Bert outline ● Contextual word representations ● Masked language model ● Next sentence prediction ● Model architecture ● Experiments a. Sentence Pair Classification [MNLI] b. Single Sentence Classification [SST-2] c. Question Answering [SQuAD] d. Single Sentence Tagging [CoNLL-NER] 24

25.

26.

27.

28.

29.

30.

31.

32.

33.

34. 34

35. 35

36.

37.

38.

39.

40.

41. SQuAD -- Stanford Question Answering Dataset 41

42.

43. 43 SQuAD1.1 Leaderboard

44. Conclusion ● BERT is strong pre-trained language model that uses bidirectional transformer ● BERT can be fine-tuned to achieve good performance in many NLP tasks ● The source code is available at github 44

45.

46. References ● Stanford CS224n: Natural Language Processing with Deep Learning ● Stanford CS231n: Convolutional Neural Networks for Visual Recognition ● http://people.ee.duke.edu/~lcarin/Kevin8.3.2018.pdf ● https://zhuanlan.zhihu.com/p/52282552 ● https://zhuanlan.zhihu.com/p/46178084 ● https://zhuanlan.zhihu.com/p/39034683 46

BERT: Bidirectional Encoder Representations from Transformers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BERT: Bidirectional Encoder Representations from Transformers

Similar to BERT: Bidirectional Encoder Representations from Transformers (20)

More from Liangqun Lu

More from Liangqun Lu (13)

Recently uploaded

Recently uploaded (20)

BERT: Bidirectional Encoder Representations from Transformers