BERT
Bidirectional Encoder Representations from Transformers
Jeangoo Yoon
Hanwha Systems / ICT
Pre-training in NLP
NLP의 최대 난제는 Training data 부족
수백~수천 단어의 human-labeled training examples
Deep learning based NLP 모델 특성
수백만~십억 단어의 annotated training samples로
학습시키면 성능이 좋아진다
대량의 unannotated text on the web을 사용해서
general purpose 학습을 시킨다.
Pre-trained Representation model
question Answering, sentiment analysis 등
small-data NLP task에서 from scratch
training model보다 정확도 향상이 높다.
ELMo*
(feature based)
OpenAI GPT*
(fine tuning)
*Generative Pre-trained Transformer*Embeddings from Language Model
What makes BERT different?
context-free
Pre-trained
representations
contextual unidirectional
bidirectional
word2vec, GloVe
OpenAI GPT
ELMO (shallow)
BERT (deep)
BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus
Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo
uses the concatenation of independently trained left-to-right and right- to-left LSTM to generate features for downstream tasks. Among three,
only BERT representations are jointly conditioned on both left and right context in all layers.
NLP in 2018
• Word embedding
– word2vec(Mikolav et al., 2013), GloVe(Pennigton et al., 2014), fastText(Bojanowski et al., 2016)
• Language model
– ULMFiT(Howard et al., 18 Jan 2018) - Universal language model fine-tuning for text classification
– ELMo (Peters et al., 15 Feb 2018) - Deep contextualized word representations
• bidirectional, LSTM, language model base
• Attention model
– Transformer(Vaswani et al., 12 Jun 2017) - Attention is all you need
• attention + feed forward network, RNN 대체
• Attention: seq A(a1, a2, a3) -> seq B (b1,b2, b3) 변환문제에서 decoder가 생성한 b1을 encoder A에서 찾는게 아니라 가중치로
찾는 방식. 즉, b1= w1a1+w2a2+w3a3
– OpenAI GPT(Radford et al., 11 Jun 2018) - Improving language understanding by generative pre-training
• unidirectional, attention, transformer
– BERT(Devlin et al., 11 Oct 2018) - BERT: Pre-training of deep bidirectional transformers for language
understanding
BERT (1/4)
• Input Representation
– WordPiece embeddings with 30,000 token, split word piece with ##
– positional embedding with up to 512 tokens
• Pre-training time
– BERTbase: 4 Cloud TPU(16 TPU chips), 4days
– BERTlarge: 16 Cloud TPU(64 TPU chips), 4days
• Model Architecture (L: layers(transformer block), H: hidden size, A: self-attention head)
– BERTbase: L=12, H=768, A=12, Parameters=110M
– BERTlarge: L=24, H=1024, A=16, Parameters=340M
BERT (2/4)
• Pre-training tasks
– Masked language model
• 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy → my dog is [MASK]
• 10% of the time: Replace the word with a random word, e.g., my dog is hairy → my dog is apple
• 10% of the time: Keep the word un- changed, e.g., my dog is hairy → my dog is hairy
– Next sentence prediction
• Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
Label = IsNext
• Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext
• Pre-training procedure
– BooksCorpus(800M words) & English Wikipedia(2,500M words)
– sentence A embedding + sentence B embedding(50% is actual next, 50% is random), combined length <= 512
tokens
– batch size: 256 sequence * 512 tokens = 128,000 tokens/batch for 1M steps, 40 epochs over 3.3B word corpus
– Adam, dropout (0.1 on all layers), ‘gelu’ activation (=OpenAI GPT)
– training loss = mean MLM likelihood + mean next sentence prediction likelihood
BERT (3/4)
• Fine-tuning procedure
– batch size: 16, 32
– Learning rate(Adam): 5e-5, 3e-5, 2e-5
– epochs: 3,4
• BERT vs OpenAI GPT
– OpenAI GPT: left-to-right Transformer LM on a large text corpus
GPT BERT
BooksCorpus (800M words) BooksCorpus (800M words) and Wikipedia (2,500M words)
sentence separator ([SEP]) and classifier token ([CLS]) which
are only introduced at fine-tuning time
learns [SEP], [CLS] and sentence A/B embed- dings during
pre-training
1M steps with a batch size of 32,000 words 1M steps with a batch size of 128,000 words
same learning rate of 5e-5 for all fine-tuning experiments
task-specific fine-tuning learning rate which performs the best
on the development set
BERT (4/4)
GLUE
General Language Understanding Evaluation benchmark
9 tasks, training data 2.5k~400k
CoNLL-2003 NER
Named Entity Recognition
200k words annotated as Person, Organization,
Location,Miscellaneous or Other
SWAG Dev
Situations With Adversarial Generations
113k sentence-pari completion examples
evaluate grounded common sense inference
SQuAD
Stanford Question Answering Dataset
100k crowdsourced question/answer pairs
NLP의 2018년 = Vision의 2015년

BERT introduction

  • 1.
    BERT Bidirectional Encoder Representationsfrom Transformers Jeangoo Yoon Hanwha Systems / ICT
  • 2.
    Pre-training in NLP NLP의최대 난제는 Training data 부족 수백~수천 단어의 human-labeled training examples Deep learning based NLP 모델 특성 수백만~십억 단어의 annotated training samples로 학습시키면 성능이 좋아진다 대량의 unannotated text on the web을 사용해서 general purpose 학습을 시킨다. Pre-trained Representation model question Answering, sentiment analysis 등 small-data NLP task에서 from scratch training model보다 정확도 향상이 높다. ELMo* (feature based) OpenAI GPT* (fine tuning) *Generative Pre-trained Transformer*Embeddings from Language Model
  • 3.
    What makes BERTdifferent? context-free Pre-trained representations contextual unidirectional bidirectional word2vec, GloVe OpenAI GPT ELMO (shallow) BERT (deep) BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right- to-left LSTM to generate features for downstream tasks. Among three, only BERT representations are jointly conditioned on both left and right context in all layers.
  • 4.
    NLP in 2018 •Word embedding – word2vec(Mikolav et al., 2013), GloVe(Pennigton et al., 2014), fastText(Bojanowski et al., 2016) • Language model – ULMFiT(Howard et al., 18 Jan 2018) - Universal language model fine-tuning for text classification – ELMo (Peters et al., 15 Feb 2018) - Deep contextualized word representations • bidirectional, LSTM, language model base • Attention model – Transformer(Vaswani et al., 12 Jun 2017) - Attention is all you need • attention + feed forward network, RNN 대체 • Attention: seq A(a1, a2, a3) -> seq B (b1,b2, b3) 변환문제에서 decoder가 생성한 b1을 encoder A에서 찾는게 아니라 가중치로 찾는 방식. 즉, b1= w1a1+w2a2+w3a3 – OpenAI GPT(Radford et al., 11 Jun 2018) - Improving language understanding by generative pre-training • unidirectional, attention, transformer – BERT(Devlin et al., 11 Oct 2018) - BERT: Pre-training of deep bidirectional transformers for language understanding
  • 5.
    BERT (1/4) • InputRepresentation – WordPiece embeddings with 30,000 token, split word piece with ## – positional embedding with up to 512 tokens • Pre-training time – BERTbase: 4 Cloud TPU(16 TPU chips), 4days – BERTlarge: 16 Cloud TPU(64 TPU chips), 4days • Model Architecture (L: layers(transformer block), H: hidden size, A: self-attention head) – BERTbase: L=12, H=768, A=12, Parameters=110M – BERTlarge: L=24, H=1024, A=16, Parameters=340M
  • 6.
    BERT (2/4) • Pre-trainingtasks – Masked language model • 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy → my dog is [MASK] • 10% of the time: Replace the word with a random word, e.g., my dog is hairy → my dog is apple • 10% of the time: Keep the word un- changed, e.g., my dog is hairy → my dog is hairy – Next sentence prediction • Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = IsNext • Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP] Label = NotNext • Pre-training procedure – BooksCorpus(800M words) & English Wikipedia(2,500M words) – sentence A embedding + sentence B embedding(50% is actual next, 50% is random), combined length <= 512 tokens – batch size: 256 sequence * 512 tokens = 128,000 tokens/batch for 1M steps, 40 epochs over 3.3B word corpus – Adam, dropout (0.1 on all layers), ‘gelu’ activation (=OpenAI GPT) – training loss = mean MLM likelihood + mean next sentence prediction likelihood
  • 7.
    BERT (3/4) • Fine-tuningprocedure – batch size: 16, 32 – Learning rate(Adam): 5e-5, 3e-5, 2e-5 – epochs: 3,4 • BERT vs OpenAI GPT – OpenAI GPT: left-to-right Transformer LM on a large text corpus GPT BERT BooksCorpus (800M words) BooksCorpus (800M words) and Wikipedia (2,500M words) sentence separator ([SEP]) and classifier token ([CLS]) which are only introduced at fine-tuning time learns [SEP], [CLS] and sentence A/B embed- dings during pre-training 1M steps with a batch size of 32,000 words 1M steps with a batch size of 128,000 words same learning rate of 5e-5 for all fine-tuning experiments task-specific fine-tuning learning rate which performs the best on the development set
  • 8.
    BERT (4/4) GLUE General LanguageUnderstanding Evaluation benchmark 9 tasks, training data 2.5k~400k CoNLL-2003 NER Named Entity Recognition 200k words annotated as Person, Organization, Location,Miscellaneous or Other SWAG Dev Situations With Adversarial Generations 113k sentence-pari completion examples evaluate grounded common sense inference SQuAD Stanford Question Answering Dataset 100k crowdsourced question/answer pairs
  • 9.
    NLP의 2018년 =Vision의 2015년