BERT introduction

BERT
Bidirectional Encoder Representations from Transformers
Jeangoo Yoon
Hanwha Systems / ICT

Pre-training in NLP
NLP의 최대 난제는 Training data 부족
수백~수천 단어의 human-labeled training examples
Deep learning based NLP 모델 특성
수백만~십억 단어의 annotated training samples로
학습시키면 성능이 좋아진다
대량의 unannotated text on the web을 사용해서
general purpose 학습을 시킨다.
Pre-trained Representation model
question Answering, sentiment analysis 등
small-data NLP task에서 from scratch
training model보다 정확도 향상이 높다.
ELMo*
(feature based)
OpenAI GPT*
(fine tuning)
*Generative Pre-trained Transformer*Embeddings from Language Model

What makes BERT different?
context-free
Pre-trained
representations
contextual unidirectional
bidirectional
word2vec, GloVe
OpenAI GPT
ELMO (shallow)
BERT (deep)
BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus
Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo
uses the concatenation of independently trained left-to-right and right- to-left LSTM to generate features for downstream tasks. Among three,
only BERT representations are jointly conditioned on both left and right context in all layers.

NLP in 2018
• Word embedding
– word2vec(Mikolav et al., 2013), GloVe(Pennigton et al., 2014), fastText(Bojanowski et al., 2016)
• Language model
– ULMFiT(Howard et al., 18 Jan 2018) - Universal language model fine-tuning for text classification
– ELMo (Peters et al., 15 Feb 2018) - Deep contextualized word representations
• bidirectional, LSTM, language model base
• Attention model
– Transformer(Vaswani et al., 12 Jun 2017) - Attention is all you need
• attention + feed forward network, RNN 대체
• Attention: seq A(a1, a2, a3) -> seq B (b1,b2, b3) 변환문제에서 decoder가 생성한 b1을 encoder A에서 찾는게 아니라 가중치로
찾는 방식. 즉, b1= w1a1+w2a2+w3a3
– OpenAI GPT(Radford et al., 11 Jun 2018) - Improving language understanding by generative pre-training
• unidirectional, attention, transformer
– BERT(Devlin et al., 11 Oct 2018) - BERT: Pre-training of deep bidirectional transformers for language
understanding

BERT (1/4)
• Input Representation
– WordPiece embeddings with 30,000 token, split word piece with ##
– positional embedding with up to 512 tokens
• Pre-training time
– BERTbase: 4 Cloud TPU(16 TPU chips), 4days
– BERTlarge: 16 Cloud TPU(64 TPU chips), 4days
• Model Architecture (L: layers(transformer block), H: hidden size, A: self-attention head)
– BERTbase: L=12, H=768, A=12, Parameters=110M
– BERTlarge: L=24, H=1024, A=16, Parameters=340M

BERT (2/4)
• Pre-training tasks
– Masked language model
• 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy → my dog is [MASK]
• 10% of the time: Replace the word with a random word, e.g., my dog is hairy → my dog is apple
• 10% of the time: Keep the word un- changed, e.g., my dog is hairy → my dog is hairy
– Next sentence prediction
• Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
Label = IsNext
• Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext
• Pre-training procedure
– BooksCorpus(800M words) & English Wikipedia(2,500M words)
– sentence A embedding + sentence B embedding(50% is actual next, 50% is random), combined length <= 512
tokens
– batch size: 256 sequence * 512 tokens = 128,000 tokens/batch for 1M steps, 40 epochs over 3.3B word corpus
– Adam, dropout (0.1 on all layers), ‘gelu’ activation (=OpenAI GPT)
– training loss = mean MLM likelihood + mean next sentence prediction likelihood

BERT (3/4)
• Fine-tuning procedure
– batch size: 16, 32
– Learning rate(Adam): 5e-5, 3e-5, 2e-5
– epochs: 3,4
• BERT vs OpenAI GPT
– OpenAI GPT: left-to-right Transformer LM on a large text corpus
GPT BERT
BooksCorpus (800M words) BooksCorpus (800M words) and Wikipedia (2,500M words)
sentence separator ([SEP]) and classifier token ([CLS]) which
are only introduced at fine-tuning time
learns [SEP], [CLS] and sentence A/B embeddings during
pre-training
1M steps with a batch size of 32,000 words 1M steps with a batch size of 128,000 words
same learning rate of 5e-5 for all fine-tuning experiments
task-specific fine-tuning learning rate which performs the best
on the development set

BERT (4/4)
GLUE
General Language Understanding Evaluation benchmark
9 tasks, training data 2.5k~400k
CoNLL-2003 NER
Named Entity Recognition
200k words annotated as Person, Organization,
Location,Miscellaneous or Other
SWAG Dev
Situations With Adversarial Generations
113k sentence-pari completion examples
evaluate grounded common sense inference
SQuAD
Stanford Question Answering Dataset
100k crowdsourced question/answer pairs

NLP의 2018년 = Vision의 2015년

BERT introduction

More Related Content

What's hot

Similar to BERT introduction

Recently uploaded

BERT introduction