1909 BERT: why-and-how (CODE SEMINAR)

BERT – Why and How?
HIL SEMINAR 190927
Cho Won Ik

• A short overview on recent LMs
• BERT
• Architecture
• Featurization
• Training objectives
• Representation
• Pre-trained models
• Fine-tuning
• Explaining BERT
• BERT and after (+ References)
Contents

• From n-grams to fastText: (pre-trained) word vectors
• N-gram
• Non-compressed bunch of words
• Word2vec [Mikolov et al., 2013]
• Skip-gram & CBOW
• Large vocab set projected to a smaller space, with the constraints
• GloVe [Pennington et al., 2014]
• Co-occurrence matrix combined
• fastText [Bojanowski et al., 2016]
• Subword information
A short overview on recent LMs

• Contextualized word embedding
• ELMo [Peters et al., 2018]
image from https://jalammar.github.io/illustrated-bert/

• And how transformers engaged in:
• All you need is attention! [Vaswani et al., 2017]
• OpenAI GPT [Radford et al., 2018] (image source)

• A new powerful self-supervised pretrained LM!
• Jacob Devlin et al., "Bert: Pre-training of deep bidirectional
transformers for language understanding," arXiv preprint
arXiv:1810.04805, 2018.
• Original Google Research implementation (TensorFlow)
• https://github.com/google-research/bert
• Hugging Face implementation on Transformer-based
architectures (PyTorch) – Reference code
• in https://github.com/huggingface/pytorch-transformers
• Currently changed to https://github.com/huggingface/transformers
(updated 2019.09.26!)
• BERT, GPT, GPT-2, Transformer-XL, XLNet, XLM, RoBERTa, DistilBERT

• BERT has only an encoder
• No explicit decoder!
• BERT uses the encoder from Transformer
– What’s same & what’s different?
• Same
• The module & submodule structure (multi-head attention, scale-dot
product, in-out framework etc.)
• Different
• No unidirectional restriction in the output layer while training
• Why BERT is ``B’’ERT
• Two simple tasks (will be explained)
BERT - Architecture

• Setup
• Let’s pretrain via simple task to utilize in downstream tasks
BERT - Architecture

• Setup
• Input: concatenation of two segments (sequences of tokens)
𝑥1, … , 𝑥 𝑁 and 𝑦1, … , 𝑦 𝑀
• 𝐶𝐿𝑆 𝑥1, … , 𝑥 𝑁 𝑆𝐸𝑃 𝑦1, … , 𝑦 𝑀 [𝑆𝐸𝑃]
• [CLS] trained to represent the value for classification
• [SEP] denotes the separating point between the sequences
• Each segment – equal to or more than one natural ‘sentences’
• 𝑁 + 𝑀 < 𝑇 for 𝑇 the parameter
that controls the maximum
sequence length
BERT - Architecture
(BERT original paper)

• Setup
• Input representation
BERT - Architecture

• Code for BERT architecture
BERT - Architecture

• Code for BERT architecture (cont’d)
BERT - Architecture

• Hugging Face BERT has two options:
• Basic (punctuation splitting, lower casing, etc.)
• WordPiece
• Tokenizes a piece of text into its word pieces
• A greedy longest-match-first algorithm to perform tokenization using
the given vocabulary
• e.g., input = “unaffable” / output = [“un”, “##aff”, “##able”]
• Args: text
• A single token or whitespace separated tokens which should have
already been passed through `BasicTokenizer`
• Returns:
• A list of wordpiece tokens.
BERT - Featurization

• Code for BERT tokenizer: punctuation split + WPM

• Code for basic tokenizer

• Code for WPM [Sennrich et al., 2015]

• Code for BERT embedding

• Code for BERT embedding (cont’d)

• BERT learns how to represent the context
• via a hard training with two simple tasks
• Masked LM (MLM)
• Cloze task [Taylor, 1953] – filling in the blank
• Similar to what SpecAugment [Park et al., 2019] does?
• Next sentence prediction (NSP)
• Checks the relevance between the sentences
BERT – Training objectives

• Masked language model (MLM)
• A random sample of the tokens in the input sequence is
selected and replaced with the special token [MASK]
• MLM objective: cross-entropy loss on predicting the masked
tokens
• BERT uniformly selects 15% of input tokens for possible replacement
• Of the selected tokens, 80% are replaced with [MASK]
• 10% left unchanged
• 10% replaced by a
randomly selected
vocabulary token
(THESE code from Google BERT)

• masked_lm_labels: (`optional`) ``torch.LongTensor`` of shape
``(batch_size, sequence_length)``:
• Labels for computing the masked language modeling loss
• Indices should be in ``[-1, 0, ..., config.vocab_size]``
• Tokens with indices set to ``-1`` are ignored (masked), the loss is only
computed for the tokens with labels in ``[0, ..., config.vocab_size]``
• Example:

• Outputs: `Tuple` comprising various elements depending on
the configuration (config) and inputs:
• loss: (`optional`, returned when ``masked_lm_labels`` is provided)
``torch.FloatTensor`` of shape ``(1,)``: MLM loss
• prediction_scores: ``torch.FloatTensor`` of shape ``(batch_size,
sequence_length, config.vocab_size)``: Prediction scores of the LM
head (scores for each vocabulary token before SoftMax)
• hidden_states: (`optional`, returned when
``config.output_hidden_states=True``): list of ``torch.FloatTensor``
(one for the output of each layer + the output of the embeddings) of
shape ``(batch_size, sequence_length, hidden_size)``: Hidden-states
of the model at the output of each layer plus the initial embedding
outputs

the configuration (config) and inputs (cont’d):
• attentions: (`optional`, returned when ``config.output_attentions =
True``) list of ``torch.FloatTensor`` (one for each layer) of shape
``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the
weighted average in the self-attention heads

• Masked language model (MLM) (cont’d)

• next_sentence_label: (`optional`) ``torch.LongTensor`` of
shape ``(batch_size,)``:
• Labels for computing the next sequence prediction (classification)
loss. Input should be a sequence pair (see ``input_ids`` docstring)
• Indices should be in ``[0, 1]``.
• ``0`` indicates sequence B is a continuation of sequence A
• ``1`` indicates sequence B is a random sequence.
• Example:

• Outputs: `Tuple` comprising various elements depending on the
configuration (config) and inputs:
• loss: (òptional`, returned when ``next_sentence_label`` is provided)
``torch.FloatTensor`` of shape ``(1,)``: NSP (classification) loss
• seq_relationship_scores: ``torch.FloatTensor`` of shape ``(batch_size,
sequence_length, 2)``: Prediction scores of the NSP head (scores of
True/False continuation before SoftMax).
• hidden_states: (òptional`, returned when
``config.output_hidden_states=True``): list of ``torch.FloatTensor``
(one for the output of each layer + the output of the embeddings) of
shape ``(batch_size, sequence_length, hidden_size)``: Hidden-states of
the model at the output of each layer plus the initial embedding outputs.
• attentions: (òptional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size,
num_heads, sequence_length, sequence_length)``: Attentions weights
after the attention softmax, used to compute the weighted average in
the self-attention heads.

the configuration (config) and inputs:
• attentions: (`optional`, returned when
``config.output_attentions=True``) list of ``torch.FloatTensor`` (one for
each layer) of shape ``(batch_size, num_heads, sequence_length,
sequence_length)``: Attentions weights after the attention softmax,
used to compute the weighted average in the self-attention heads.

• Code for BERT output layers
• SelfAttention
• SelfOutput
• Attention
• Intermediate
• Output
• Encoder
BERT - Representation

• Code for BERT output layers
image from https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

• Code for BERT output layers (cont’d)

• BERT pretraining

• BERT pretraining (cont’d)

• What does BERT encode?
• A mixture of syntactic, semantic and possibly pre-phonetic
information?
• In what form is information encoded?

• BERT Base and Large
• L: the number of layers
• H: the hidden size
• A: the number of self-attention heads
• Training corpus: BOOKCORPUS + WIKIPEDIA (TOTAL 16GB)
• 𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸: 𝐿 = 12, 𝐻 = 768, 𝐴 = 12, 𝑃𝑎𝑟𝑎𝑚# = 110𝑀
• 𝐵𝐸𝑅𝑇𝐿𝐴𝑅𝐺𝐸: 𝐿 = 24, 𝐻 = 1024, 𝐴 = 16, 𝑃𝑎𝑟𝑎𝑚# = 340𝑀
• Pretrained models
• wrapper by Hugging Face
BERT – Pretrained models

• Code for importing pretrained models
BERT – Pretrained models

• How is the encoder representation utilized in
downstream tasks?
• The list of the tasks
• Sentence pair classification
• Single sentence classification
• Question answering
• Single sentence tagging
BERT - Fine-tuning

• Fine-tuning example
BERT - Fine-tuning

• Fine-tuning example
• e.g., run_glue.py: Fine-tuning on GLUE tasks [Wang et al.,
2019] for sequence classification (and right on MRPC task)
BERT - Fine-tuning

• Customizing with own data
BERT - Fine-tuning

• Syntactic and semantic pipelines
• I. Tenny et al., “BERT
Rediscovers the Classical NLP
Pipeline,” in Proc. ACL, 2019.
Explaining BERT

• Multilingual perspective
Explaining BERT
in Korean?
(pretrained models from Google BERT)

• Visualization
• J. Vig, “A Multiscale Visualization of Attention in the
Transformer Model,” in Proc. ACL (demo), 2019.
Explaining BERT

• Limitations
• XLNet
• RoBERTa
• DistilBERT
BERT and after
image from https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8

• Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." in Proc.
NIPS, 2013.
• J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. in Proc.
EMNLP, 2014.
• P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information." arXiv
preprint arXiv:1607.04606, 2016.
• M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word
representations,” in Proc. NAACL, 2018.
• A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all
you need,” in Proc. NIPS, 2017.
• A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding with unsupervised
learning,” Technical report, OpenAI, 2018.
• J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language
understanding” in Proc. NAACL, 2019.
• W. L. Taylor , “Cloze procedure: A new tool for measuring readability,” Journalism Bulletin, vol. 30, no.4, pp. 415–
433, 1953.
• R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual
data,” arXiv preprint arXiv:1511.06709, 2015.
References (order of appearance)

• D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. Le, “SpecAugment: A Simple Data
Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech, 2019.
• A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, "GLUE: A multi-Task benchmark and analysis
platform for natural language understanding," in Proc. ICLR, 2019.
• I. Tenny, D. Das, and E. Pavlick, “BERT rediscovers the classical NLP pipeline,” in Proc. ACL, 2019.
• J. Vig, “A multiscale visualization of attention in the Transformer model,” in Proc. ACL (demo), 2019.
• Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. Le, “XLNet: Generalized autoregressive pretraining
for language understanding,” arXiv preprint arXiv:1906.08237, 2019.
• Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A
robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
• Distilbert (article): https://medium.com/huggingface/distilbert-8cf3380435b5
• Reference code (partially modified for visualization):
https://github.com/huggingface/transformers/blob/master/transformers/modeling_bert.py
https://github.com/huggingface/transformers/blob/master/transformers/tokenization_bert.py
References (order of appearance)

1909 BERT: why-and-how (CODE SEMINAR)

More Related Content

What's hot

Similar to 1909 BERT: why-and-how (CODE SEMINAR)

More from WarNik Chow

Recently uploaded

1909 BERT: why-and-how (CODE SEMINAR)