BERT – Why and How?
HIL SEMINAR 190927
Cho Won Ik
• A short overview on recent LMs
• BERT
• Architecture
• Featurization
• Training objectives
• Representation
• Pre-trained models
• Fine-tuning
• Explaining BERT
• BERT and after (+ References)
Contents
• From n-grams to fastText: (pre-trained) word vectors
• N-gram
• Non-compressed bunch of words
• Word2vec [Mikolov et al., 2013]
• Skip-gram & CBOW
• Large vocab set projected to a smaller space, with the constraints
• GloVe [Pennington et al., 2014]
• Co-occurrence matrix combined
• fastText [Bojanowski et al., 2016]
• Subword information
A short overview on recent LMs
• Contextualized word embedding
• ELMo [Peters et al., 2018]
A short overview on recent LMs
image from https://jalammar.github.io/illustrated-bert/
• And how transformers engaged in:
• All you need is attention! [Vaswani et al., 2017]
• OpenAI GPT [Radford et al., 2018] (image source)
A short overview on recent LMs
• A new powerful self-supervised pretrained LM!
• Jacob Devlin et al., "Bert: Pre-training of deep bidirectional
transformers for language understanding," arXiv preprint
arXiv:1810.04805, 2018.
• Original Google Research implementation (TensorFlow)
• https://github.com/google-research/bert
• Hugging Face implementation on Transformer-based
architectures (PyTorch) – Reference code
• in https://github.com/huggingface/pytorch-transformers
• Currently changed to https://github.com/huggingface/transformers
(updated 2019.09.26!)
• BERT, GPT, GPT-2, Transformer-XL, XLNet, XLM, RoBERTa, DistilBERT
A short overview on recent LMs
• BERT has only an encoder
• No explicit decoder!
• BERT uses the encoder from Transformer
– What’s same & what’s different?
• Same
• The module & submodule structure (multi-head attention, scale-dot
product, in-out framework etc.)
• Different
• No unidirectional restriction in the output layer while training
• Why BERT is ``B’’ERT
• Two simple tasks (will be explained)
BERT - Architecture
• Setup
• Let’s pretrain via simple task to utilize in downstream tasks
BERT - Architecture
• Setup
• Input: concatenation of two segments (sequences of tokens)
𝑥1, … , 𝑥 𝑁 and 𝑦1, … , 𝑦 𝑀
• 𝐶𝐿𝑆 𝑥1, … , 𝑥 𝑁 𝑆𝐸𝑃 𝑦1, … , 𝑦 𝑀 [𝑆𝐸𝑃]
• [CLS] trained to represent the value for classification
• [SEP] denotes the separating point between the sequences
• Each segment – equal to or more than one natural ‘sentences’
• 𝑁 + 𝑀 < 𝑇 for 𝑇 the parameter
that controls the maximum
sequence length
BERT - Architecture
(BERT original paper)
• Setup
• Input representation
BERT - Architecture
• Code for BERT architecture
BERT - Architecture
• Code for BERT architecture (cont’d)
BERT - Architecture
• Code for BERT architecture (cont’d)
BERT - Architecture
• Code for BERT architecture (cont’d)
BERT - Architecture
• Hugging Face BERT has two options:
• Basic (punctuation splitting, lower casing, etc.)
• WordPiece
• Tokenizes a piece of text into its word pieces
• A greedy longest-match-first algorithm to perform tokenization using
the given vocabulary
• e.g., input = “unaffable” / output = [“un”, “##aff”, “##able”]
• Args: text
• A single token or whitespace separated tokens which should have
already been passed through `BasicTokenizer`
• Returns:
• A list of wordpiece tokens.
BERT - Featurization
• Code for BERT tokenizer: punctuation split + WPM
BERT - Featurization
• Code for basic tokenizer
BERT - Featurization
• Code for WPM [Sennrich et al., 2015]
BERT - Featurization
• Code for BERT embedding
BERT - Featurization
• Code for BERT embedding (cont’d)
BERT - Featurization
• BERT learns how to represent the context
• via a hard training with two simple tasks
• Masked LM (MLM)
• Cloze task [Taylor, 1953] – filling in the blank
• Similar to what SpecAugment [Park et al., 2019] does?
• Next sentence prediction (NSP)
• Checks the relevance between the sentences
BERT – Training objectives
• Masked language model (MLM)
• A random sample of the tokens in the input sequence is
selected and replaced with the special token [MASK]
• MLM objective: cross-entropy loss on predicting the masked
tokens
• BERT uniformly selects 15% of input tokens for possible replacement
• Of the selected tokens, 80% are replaced with [MASK]
• 10% left unchanged
• 10% replaced by a
randomly selected
vocabulary token
BERT – Training objectives
(THESE code from Google BERT)
• Masked language model (MLM)
• masked_lm_labels: (`optional`) ``torch.LongTensor`` of shape
``(batch_size, sequence_length)``:
• Labels for computing the masked language modeling loss
• Indices should be in ``[-1, 0, ..., config.vocab_size]``
• Tokens with indices set to ``-1`` are ignored (masked), the loss is only
computed for the tokens with labels in ``[0, ..., config.vocab_size]``
• Example:
BERT – Training objectives
• Masked language model (MLM)
• Outputs: `Tuple` comprising various elements depending on
the configuration (config) and inputs:
• loss: (`optional`, returned when ``masked_lm_labels`` is provided)
``torch.FloatTensor`` of shape ``(1,)``: MLM loss
• prediction_scores: ``torch.FloatTensor`` of shape ``(batch_size,
sequence_length, config.vocab_size)``: Prediction scores of the LM
head (scores for each vocabulary token before SoftMax)
• hidden_states: (`optional`, returned when
``config.output_hidden_states=True``): list of ``torch.FloatTensor``
(one for the output of each layer + the output of the embeddings) of
shape ``(batch_size, sequence_length, hidden_size)``: Hidden-states
of the model at the output of each layer plus the initial embedding
outputs
BERT – Training objectives
• Masked language model (MLM)
• Outputs: `Tuple` comprising various elements depending on
the configuration (config) and inputs (cont’d):
• attentions: (`optional`, returned when ``config.output_attentions =
True``) list of ``torch.FloatTensor`` (one for each layer) of shape
``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the
weighted average in the self-attention heads
BERT – Training objectives
• Masked language model (MLM)
BERT – Training objectives
• Masked language model (MLM) (cont’d)
BERT – Training objectives
• Next sentence prediction (NSP)
• next_sentence_label: (`optional`) ``torch.LongTensor`` of
shape ``(batch_size,)``:
• Labels for computing the next sequence prediction (classification)
loss. Input should be a sequence pair (see ``input_ids`` docstring)
• Indices should be in ``[0, 1]``.
• ``0`` indicates sequence B is a continuation of sequence A
• ``1`` indicates sequence B is a random sequence.
• Example:
BERT – Training objectives
• Next sentence prediction (NSP)
• Outputs: `Tuple` comprising various elements depending on the
configuration (config) and inputs:
• loss: (`optional`, returned when ``next_sentence_label`` is provided)
``torch.FloatTensor`` of shape ``(1,)``: NSP (classification) loss
• seq_relationship_scores: ``torch.FloatTensor`` of shape ``(batch_size,
sequence_length, 2)``: Prediction scores of the NSP head (scores of
True/False continuation before SoftMax).
• hidden_states: (`optional`, returned when
``config.output_hidden_states=True``): list of ``torch.FloatTensor``
(one for the output of each layer + the output of the embeddings) of
shape ``(batch_size, sequence_length, hidden_size)``: Hidden-states of
the model at the output of each layer plus the initial embedding outputs.
• attentions: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size,
num_heads, sequence_length, sequence_length)``: Attentions weights
after the attention softmax, used to compute the weighted average in
the self-attention heads.
BERT – Training objectives
• Next sentence prediction (NSP)
• Outputs: `Tuple` comprising various elements depending on
the configuration (config) and inputs:
• attentions: (`optional`, returned when
``config.output_attentions=True``) list of ``torch.FloatTensor`` (one for
each layer) of shape ``(batch_size, num_heads, sequence_length,
sequence_length)``: Attentions weights after the attention softmax,
used to compute the weighted average in the self-attention heads.
BERT – Training objectives
• Next sentence prediction (NSP)
BERT – Training objectives
• Code for BERT output layers
• SelfAttention
• SelfOutput
• Attention
• Intermediate
• Output
• Encoder
BERT - Representation
• Code for BERT output layers
BERT - Representation
image from https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
• Code for BERT output layers (cont’d)
BERT - Representation
• Code for BERT output layers (cont’d)
BERT - Representation
• Code for BERT output layers (cont’d)
BERT - Representation
• Code for BERT output layers (cont’d)
BERT - Representation
• Code for BERT output layers (cont’d)
BERT - Representation
• Code for BERT output layers (cont’d)
BERT - Representation
• BERT pretraining
BERT - Representation
• BERT pretraining (cont’d)
BERT - Representation
• What does BERT encode?
• A mixture of syntactic, semantic and possibly pre-phonetic
information?
• In what form is information encoded?
BERT - Representation
• BERT Base and Large
• L: the number of layers
• H: the hidden size
• A: the number of self-attention heads
• Training corpus: BOOKCORPUS + WIKIPEDIA (TOTAL 16GB)
• 𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸: 𝐿 = 12, 𝐻 = 768, 𝐴 = 12, 𝑃𝑎𝑟𝑎𝑚# = 110𝑀
• 𝐵𝐸𝑅𝑇𝐿𝐴𝑅𝐺𝐸: 𝐿 = 24, 𝐻 = 1024, 𝐴 = 16, 𝑃𝑎𝑟𝑎𝑚# = 340𝑀
• Pretrained models
• wrapper by Hugging Face
BERT – Pretrained models
• Code for importing pretrained models
BERT – Pretrained models
• Code for importing pretrained models
BERT – Pretrained models
• How is the encoder representation utilized in
downstream tasks?
• The list of the tasks
• Sentence pair classification
• Single sentence classification
• Question answering
• Single sentence tagging
BERT - Fine-tuning
• Fine-tuning example
BERT - Fine-tuning
• Fine-tuning example
BERT - Fine-tuning
• Fine-tuning example
• e.g., run_glue.py: Fine-tuning on GLUE tasks [Wang et al.,
2019] for sequence classification (and right on MRPC task)
BERT - Fine-tuning
• Customizing with own data
BERT - Fine-tuning
• Syntactic and semantic pipelines
• I. Tenny et al., “BERT
Rediscovers the Classical NLP
Pipeline,” in Proc. ACL, 2019.
Explaining BERT
• Multilingual perspective
Explaining BERT
in Korean?
(pretrained models from Google BERT)
• Visualization
• J. Vig, “A Multiscale Visualization of Attention in the
Transformer Model,” in Proc. ACL (demo), 2019.
Explaining BERT
• Limitations
• XLNet
• RoBERTa
• DistilBERT
BERT and after
image from https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8
• Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." in Proc.
NIPS, 2013.
• J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. in Proc.
EMNLP, 2014.
• P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information." arXiv
preprint arXiv:1607.04606, 2016.
• M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word
representations,” in Proc. NAACL, 2018.
• A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all
you need,” in Proc. NIPS, 2017.
• A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding with unsupervised
learning,” Technical report, OpenAI, 2018.
• J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language
understanding” in Proc. NAACL, 2019.
• W. L. Taylor , “Cloze procedure: A new tool for measuring readability,” Journalism Bulletin, vol. 30, no.4, pp. 415–
433, 1953.
• R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual
data,” arXiv preprint arXiv:1511.06709, 2015.
References (order of appearance)
• D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. Le, “SpecAugment: A Simple Data
Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech, 2019.
• A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, "GLUE: A multi-Task benchmark and analysis
platform for natural language understanding," in Proc. ICLR, 2019.
• I. Tenny, D. Das, and E. Pavlick, “BERT rediscovers the classical NLP pipeline,” in Proc. ACL, 2019.
• J. Vig, “A multiscale visualization of attention in the Transformer model,” in Proc. ACL (demo), 2019.
• Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. Le, “XLNet: Generalized autoregressive pretraining
for language understanding,” arXiv preprint arXiv:1906.08237, 2019.
• Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A
robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
• Distilbert (article): https://medium.com/huggingface/distilbert-8cf3380435b5
• Reference code (partially modified for visualization):
https://github.com/huggingface/transformers/blob/master/transformers/modeling_bert.py
https://github.com/huggingface/transformers/blob/master/transformers/tokenization_bert.py
References (order of appearance)
Thank You!!!

1909 BERT: why-and-how (CODE SEMINAR)

  • 1.
    BERT – Whyand How? HIL SEMINAR 190927 Cho Won Ik
  • 2.
    • A shortoverview on recent LMs • BERT • Architecture • Featurization • Training objectives • Representation • Pre-trained models • Fine-tuning • Explaining BERT • BERT and after (+ References) Contents
  • 3.
    • From n-gramsto fastText: (pre-trained) word vectors • N-gram • Non-compressed bunch of words • Word2vec [Mikolov et al., 2013] • Skip-gram & CBOW • Large vocab set projected to a smaller space, with the constraints • GloVe [Pennington et al., 2014] • Co-occurrence matrix combined • fastText [Bojanowski et al., 2016] • Subword information A short overview on recent LMs
  • 4.
    • Contextualized wordembedding • ELMo [Peters et al., 2018] A short overview on recent LMs image from https://jalammar.github.io/illustrated-bert/
  • 5.
    • And howtransformers engaged in: • All you need is attention! [Vaswani et al., 2017] • OpenAI GPT [Radford et al., 2018] (image source) A short overview on recent LMs
  • 6.
    • A newpowerful self-supervised pretrained LM! • Jacob Devlin et al., "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. • Original Google Research implementation (TensorFlow) • https://github.com/google-research/bert • Hugging Face implementation on Transformer-based architectures (PyTorch) – Reference code • in https://github.com/huggingface/pytorch-transformers • Currently changed to https://github.com/huggingface/transformers (updated 2019.09.26!) • BERT, GPT, GPT-2, Transformer-XL, XLNet, XLM, RoBERTa, DistilBERT A short overview on recent LMs
  • 7.
    • BERT hasonly an encoder • No explicit decoder! • BERT uses the encoder from Transformer – What’s same & what’s different? • Same • The module & submodule structure (multi-head attention, scale-dot product, in-out framework etc.) • Different • No unidirectional restriction in the output layer while training • Why BERT is ``B’’ERT • Two simple tasks (will be explained) BERT - Architecture
  • 8.
    • Setup • Let’spretrain via simple task to utilize in downstream tasks BERT - Architecture
  • 9.
    • Setup • Input:concatenation of two segments (sequences of tokens) 𝑥1, … , 𝑥 𝑁 and 𝑦1, … , 𝑦 𝑀 • 𝐶𝐿𝑆 𝑥1, … , 𝑥 𝑁 𝑆𝐸𝑃 𝑦1, … , 𝑦 𝑀 [𝑆𝐸𝑃] • [CLS] trained to represent the value for classification • [SEP] denotes the separating point between the sequences • Each segment – equal to or more than one natural ‘sentences’ • 𝑁 + 𝑀 < 𝑇 for 𝑇 the parameter that controls the maximum sequence length BERT - Architecture (BERT original paper)
  • 10.
    • Setup • Inputrepresentation BERT - Architecture
  • 11.
    • Code forBERT architecture BERT - Architecture
  • 12.
    • Code forBERT architecture (cont’d) BERT - Architecture
  • 13.
    • Code forBERT architecture (cont’d) BERT - Architecture
  • 14.
    • Code forBERT architecture (cont’d) BERT - Architecture
  • 15.
    • Hugging FaceBERT has two options: • Basic (punctuation splitting, lower casing, etc.) • WordPiece • Tokenizes a piece of text into its word pieces • A greedy longest-match-first algorithm to perform tokenization using the given vocabulary • e.g., input = “unaffable” / output = [“un”, “##aff”, “##able”] • Args: text • A single token or whitespace separated tokens which should have already been passed through `BasicTokenizer` • Returns: • A list of wordpiece tokens. BERT - Featurization
  • 16.
    • Code forBERT tokenizer: punctuation split + WPM BERT - Featurization
  • 17.
    • Code forbasic tokenizer BERT - Featurization
  • 18.
    • Code forWPM [Sennrich et al., 2015] BERT - Featurization
  • 19.
    • Code forBERT embedding BERT - Featurization
  • 20.
    • Code forBERT embedding (cont’d) BERT - Featurization
  • 21.
    • BERT learnshow to represent the context • via a hard training with two simple tasks • Masked LM (MLM) • Cloze task [Taylor, 1953] – filling in the blank • Similar to what SpecAugment [Park et al., 2019] does? • Next sentence prediction (NSP) • Checks the relevance between the sentences BERT – Training objectives
  • 22.
    • Masked languagemodel (MLM) • A random sample of the tokens in the input sequence is selected and replaced with the special token [MASK] • MLM objective: cross-entropy loss on predicting the masked tokens • BERT uniformly selects 15% of input tokens for possible replacement • Of the selected tokens, 80% are replaced with [MASK] • 10% left unchanged • 10% replaced by a randomly selected vocabulary token BERT – Training objectives (THESE code from Google BERT)
  • 23.
    • Masked languagemodel (MLM) • masked_lm_labels: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``: • Labels for computing the masked language modeling loss • Indices should be in ``[-1, 0, ..., config.vocab_size]`` • Tokens with indices set to ``-1`` are ignored (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]`` • Example: BERT – Training objectives
  • 24.
    • Masked languagemodel (MLM) • Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs: • loss: (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``: MLM loss • prediction_scores: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``: Prediction scores of the LM head (scores for each vocabulary token before SoftMax) • hidden_states: (`optional`, returned when ``config.output_hidden_states=True``): list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings) of shape ``(batch_size, sequence_length, hidden_size)``: Hidden-states of the model at the output of each layer plus the initial embedding outputs BERT – Training objectives
  • 25.
    • Masked languagemodel (MLM) • Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs (cont’d): • attentions: (`optional`, returned when ``config.output_attentions = True``) list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``: Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads BERT – Training objectives
  • 26.
    • Masked languagemodel (MLM) BERT – Training objectives
  • 27.
    • Masked languagemodel (MLM) (cont’d) BERT – Training objectives
  • 28.
    • Next sentenceprediction (NSP) • next_sentence_label: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``: • Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring) • Indices should be in ``[0, 1]``. • ``0`` indicates sequence B is a continuation of sequence A • ``1`` indicates sequence B is a random sequence. • Example: BERT – Training objectives
  • 29.
    • Next sentenceprediction (NSP) • Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs: • loss: (`optional`, returned when ``next_sentence_label`` is provided) ``torch.FloatTensor`` of shape ``(1,)``: NSP (classification) loss • seq_relationship_scores: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, 2)``: Prediction scores of the NSP head (scores of True/False continuation before SoftMax). • hidden_states: (`optional`, returned when ``config.output_hidden_states=True``): list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings) of shape ``(batch_size, sequence_length, hidden_size)``: Hidden-states of the model at the output of each layer plus the initial embedding outputs. • attentions: (`optional`, returned when ``config.output_attentions=True``) list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``: Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. BERT – Training objectives
  • 30.
    • Next sentenceprediction (NSP) • Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs: • attentions: (`optional`, returned when ``config.output_attentions=True``) list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``: Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. BERT – Training objectives
  • 31.
    • Next sentenceprediction (NSP) BERT – Training objectives
  • 32.
    • Code forBERT output layers • SelfAttention • SelfOutput • Attention • Intermediate • Output • Encoder BERT - Representation
  • 33.
    • Code forBERT output layers BERT - Representation image from https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
  • 34.
    • Code forBERT output layers (cont’d) BERT - Representation
  • 35.
    • Code forBERT output layers (cont’d) BERT - Representation
  • 36.
    • Code forBERT output layers (cont’d) BERT - Representation
  • 37.
    • Code forBERT output layers (cont’d) BERT - Representation
  • 38.
    • Code forBERT output layers (cont’d) BERT - Representation
  • 39.
    • Code forBERT output layers (cont’d) BERT - Representation
  • 40.
    • BERT pretraining BERT- Representation
  • 41.
    • BERT pretraining(cont’d) BERT - Representation
  • 42.
    • What doesBERT encode? • A mixture of syntactic, semantic and possibly pre-phonetic information? • In what form is information encoded? BERT - Representation
  • 43.
    • BERT Baseand Large • L: the number of layers • H: the hidden size • A: the number of self-attention heads • Training corpus: BOOKCORPUS + WIKIPEDIA (TOTAL 16GB) • 𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸: 𝐿 = 12, 𝐻 = 768, 𝐴 = 12, 𝑃𝑎𝑟𝑎𝑚# = 110𝑀 • 𝐵𝐸𝑅𝑇𝐿𝐴𝑅𝐺𝐸: 𝐿 = 24, 𝐻 = 1024, 𝐴 = 16, 𝑃𝑎𝑟𝑎𝑚# = 340𝑀 • Pretrained models • wrapper by Hugging Face BERT – Pretrained models
  • 44.
    • Code forimporting pretrained models BERT – Pretrained models
  • 45.
    • Code forimporting pretrained models BERT – Pretrained models
  • 46.
    • How isthe encoder representation utilized in downstream tasks? • The list of the tasks • Sentence pair classification • Single sentence classification • Question answering • Single sentence tagging BERT - Fine-tuning
  • 47.
  • 48.
  • 49.
    • Fine-tuning example •e.g., run_glue.py: Fine-tuning on GLUE tasks [Wang et al., 2019] for sequence classification (and right on MRPC task) BERT - Fine-tuning
  • 50.
    • Customizing withown data BERT - Fine-tuning
  • 51.
    • Syntactic andsemantic pipelines • I. Tenny et al., “BERT Rediscovers the Classical NLP Pipeline,” in Proc. ACL, 2019. Explaining BERT
  • 52.
    • Multilingual perspective ExplainingBERT in Korean? (pretrained models from Google BERT)
  • 53.
    • Visualization • J.Vig, “A Multiscale Visualization of Attention in the Transformer Model,” in Proc. ACL (demo), 2019. Explaining BERT
  • 54.
    • Limitations • XLNet •RoBERTa • DistilBERT BERT and after image from https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8
  • 55.
    • Mikolov, Tomas,et al. "Distributed representations of words and phrases and their compositionality." in Proc. NIPS, 2013. • J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. in Proc. EMNLP, 2014. • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information." arXiv preprint arXiv:1607.04606, 2016. • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proc. NAACL, 2018. • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. NIPS, 2017. • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding with unsupervised learning,” Technical report, OpenAI, 2018. • J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding” in Proc. NAACL, 2019. • W. L. Taylor , “Cloze procedure: A new tool for measuring readability,” Journalism Bulletin, vol. 30, no.4, pp. 415– 433, 1953. • R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” arXiv preprint arXiv:1511.06709, 2015. References (order of appearance)
  • 56.
    • D. S.Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech, 2019. • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, "GLUE: A multi-Task benchmark and analysis platform for natural language understanding," in Proc. ICLR, 2019. • I. Tenny, D. Das, and E. Pavlick, “BERT rediscovers the classical NLP pipeline,” in Proc. ACL, 2019. • J. Vig, “A multiscale visualization of attention in the Transformer model,” in Proc. ACL (demo), 2019. • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. Le, “XLNet: Generalized autoregressive pretraining for language understanding,” arXiv preprint arXiv:1906.08237, 2019. • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. • Distilbert (article): https://medium.com/huggingface/distilbert-8cf3380435b5 • Reference code (partially modified for visualization): https://github.com/huggingface/transformers/blob/master/transformers/modeling_bert.py https://github.com/huggingface/transformers/blob/master/transformers/tokenization_bert.py References (order of appearance)
  • 57.