SlideShare a Scribd company logo
BERT – Why and How?
HIL SEMINAR 190927
Cho Won Ik
• A short overview on recent LMs
• BERT
• Architecture
• Featurization
• Training objectives
• Representation
• Pre-trained models
• Fine-tuning
• Explaining BERT
• BERT and after (+ References)
Contents
• From n-grams to fastText: (pre-trained) word vectors
• N-gram
• Non-compressed bunch of words
• Word2vec [Mikolov et al., 2013]
• Skip-gram & CBOW
• Large vocab set projected to a smaller space, with the constraints
• GloVe [Pennington et al., 2014]
• Co-occurrence matrix combined
• fastText [Bojanowski et al., 2016]
• Subword information
A short overview on recent LMs
• Contextualized word embedding
• ELMo [Peters et al., 2018]
A short overview on recent LMs
image from https://jalammar.github.io/illustrated-bert/
• And how transformers engaged in:
• All you need is attention! [Vaswani et al., 2017]
• OpenAI GPT [Radford et al., 2018] (image source)
A short overview on recent LMs
• A new powerful self-supervised pretrained LM!
• Jacob Devlin et al., "Bert: Pre-training of deep bidirectional
transformers for language understanding," arXiv preprint
arXiv:1810.04805, 2018.
• Original Google Research implementation (TensorFlow)
• https://github.com/google-research/bert
• Hugging Face implementation on Transformer-based
architectures (PyTorch) – Reference code
• in https://github.com/huggingface/pytorch-transformers
• Currently changed to https://github.com/huggingface/transformers
(updated 2019.09.26!)
• BERT, GPT, GPT-2, Transformer-XL, XLNet, XLM, RoBERTa, DistilBERT
A short overview on recent LMs
• BERT has only an encoder
• No explicit decoder!
• BERT uses the encoder from Transformer
– What’s same & what’s different?
• Same
• The module & submodule structure (multi-head attention, scale-dot
product, in-out framework etc.)
• Different
• No unidirectional restriction in the output layer while training
• Why BERT is ``B’’ERT
• Two simple tasks (will be explained)
BERT - Architecture
• Setup
• Let’s pretrain via simple task to utilize in downstream tasks
BERT - Architecture
• Setup
• Input: concatenation of two segments (sequences of tokens)
𝑥1, … , 𝑥 𝑁 and 𝑦1, … , 𝑦 𝑀
• 𝐶𝐿𝑆 𝑥1, … , 𝑥 𝑁 𝑆𝐸𝑃 𝑦1, … , 𝑦 𝑀 [𝑆𝐸𝑃]
• [CLS] trained to represent the value for classification
• [SEP] denotes the separating point between the sequences
• Each segment – equal to or more than one natural ‘sentences’
• 𝑁 + 𝑀 < 𝑇 for 𝑇 the parameter
that controls the maximum
sequence length
BERT - Architecture
(BERT original paper)
• Setup
• Input representation
BERT - Architecture
• Code for BERT architecture
BERT - Architecture
• Code for BERT architecture (cont’d)
BERT - Architecture
• Code for BERT architecture (cont’d)
BERT - Architecture
• Code for BERT architecture (cont’d)
BERT - Architecture
• Hugging Face BERT has two options:
• Basic (punctuation splitting, lower casing, etc.)
• WordPiece
• Tokenizes a piece of text into its word pieces
• A greedy longest-match-first algorithm to perform tokenization using
the given vocabulary
• e.g., input = “unaffable” / output = [“un”, “##aff”, “##able”]
• Args: text
• A single token or whitespace separated tokens which should have
already been passed through `BasicTokenizer`
• Returns:
• A list of wordpiece tokens.
BERT - Featurization
• Code for BERT tokenizer: punctuation split + WPM
BERT - Featurization
• Code for basic tokenizer
BERT - Featurization
• Code for WPM [Sennrich et al., 2015]
BERT - Featurization
• Code for BERT embedding
BERT - Featurization
• Code for BERT embedding (cont’d)
BERT - Featurization
• BERT learns how to represent the context
• via a hard training with two simple tasks
• Masked LM (MLM)
• Cloze task [Taylor, 1953] – filling in the blank
• Similar to what SpecAugment [Park et al., 2019] does?
• Next sentence prediction (NSP)
• Checks the relevance between the sentences
BERT – Training objectives
• Masked language model (MLM)
• A random sample of the tokens in the input sequence is
selected and replaced with the special token [MASK]
• MLM objective: cross-entropy loss on predicting the masked
tokens
• BERT uniformly selects 15% of input tokens for possible replacement
• Of the selected tokens, 80% are replaced with [MASK]
• 10% left unchanged
• 10% replaced by a
randomly selected
vocabulary token
BERT – Training objectives
(THESE code from Google BERT)
• Masked language model (MLM)
• masked_lm_labels: (`optional`) ``torch.LongTensor`` of shape
``(batch_size, sequence_length)``:
• Labels for computing the masked language modeling loss
• Indices should be in ``[-1, 0, ..., config.vocab_size]``
• Tokens with indices set to ``-1`` are ignored (masked), the loss is only
computed for the tokens with labels in ``[0, ..., config.vocab_size]``
• Example:
BERT – Training objectives
• Masked language model (MLM)
• Outputs: `Tuple` comprising various elements depending on
the configuration (config) and inputs:
• loss: (`optional`, returned when ``masked_lm_labels`` is provided)
``torch.FloatTensor`` of shape ``(1,)``: MLM loss
• prediction_scores: ``torch.FloatTensor`` of shape ``(batch_size,
sequence_length, config.vocab_size)``: Prediction scores of the LM
head (scores for each vocabulary token before SoftMax)
• hidden_states: (`optional`, returned when
``config.output_hidden_states=True``): list of ``torch.FloatTensor``
(one for the output of each layer + the output of the embeddings) of
shape ``(batch_size, sequence_length, hidden_size)``: Hidden-states
of the model at the output of each layer plus the initial embedding
outputs
BERT – Training objectives
• Masked language model (MLM)
• Outputs: `Tuple` comprising various elements depending on
the configuration (config) and inputs (cont’d):
• attentions: (`optional`, returned when ``config.output_attentions =
True``) list of ``torch.FloatTensor`` (one for each layer) of shape
``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the
weighted average in the self-attention heads
BERT – Training objectives
• Masked language model (MLM)
BERT – Training objectives
• Masked language model (MLM) (cont’d)
BERT – Training objectives
• Next sentence prediction (NSP)
• next_sentence_label: (`optional`) ``torch.LongTensor`` of
shape ``(batch_size,)``:
• Labels for computing the next sequence prediction (classification)
loss. Input should be a sequence pair (see ``input_ids`` docstring)
• Indices should be in ``[0, 1]``.
• ``0`` indicates sequence B is a continuation of sequence A
• ``1`` indicates sequence B is a random sequence.
• Example:
BERT – Training objectives
• Next sentence prediction (NSP)
• Outputs: `Tuple` comprising various elements depending on the
configuration (config) and inputs:
• loss: (`optional`, returned when ``next_sentence_label`` is provided)
``torch.FloatTensor`` of shape ``(1,)``: NSP (classification) loss
• seq_relationship_scores: ``torch.FloatTensor`` of shape ``(batch_size,
sequence_length, 2)``: Prediction scores of the NSP head (scores of
True/False continuation before SoftMax).
• hidden_states: (`optional`, returned when
``config.output_hidden_states=True``): list of ``torch.FloatTensor``
(one for the output of each layer + the output of the embeddings) of
shape ``(batch_size, sequence_length, hidden_size)``: Hidden-states of
the model at the output of each layer plus the initial embedding outputs.
• attentions: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size,
num_heads, sequence_length, sequence_length)``: Attentions weights
after the attention softmax, used to compute the weighted average in
the self-attention heads.
BERT – Training objectives
• Next sentence prediction (NSP)
• Outputs: `Tuple` comprising various elements depending on
the configuration (config) and inputs:
• attentions: (`optional`, returned when
``config.output_attentions=True``) list of ``torch.FloatTensor`` (one for
each layer) of shape ``(batch_size, num_heads, sequence_length,
sequence_length)``: Attentions weights after the attention softmax,
used to compute the weighted average in the self-attention heads.
BERT – Training objectives
• Next sentence prediction (NSP)
BERT – Training objectives
• Code for BERT output layers
• SelfAttention
• SelfOutput
• Attention
• Intermediate
• Output
• Encoder
BERT - Representation
• Code for BERT output layers
BERT - Representation
image from https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
• Code for BERT output layers (cont’d)
BERT - Representation
• Code for BERT output layers (cont’d)
BERT - Representation
• Code for BERT output layers (cont’d)
BERT - Representation
• Code for BERT output layers (cont’d)
BERT - Representation
• Code for BERT output layers (cont’d)
BERT - Representation
• Code for BERT output layers (cont’d)
BERT - Representation
• BERT pretraining
BERT - Representation
• BERT pretraining (cont’d)
BERT - Representation
• What does BERT encode?
• A mixture of syntactic, semantic and possibly pre-phonetic
information?
• In what form is information encoded?
BERT - Representation
• BERT Base and Large
• L: the number of layers
• H: the hidden size
• A: the number of self-attention heads
• Training corpus: BOOKCORPUS + WIKIPEDIA (TOTAL 16GB)
• 𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸: 𝐿 = 12, 𝐻 = 768, 𝐴 = 12, 𝑃𝑎𝑟𝑎𝑚# = 110𝑀
• 𝐵𝐸𝑅𝑇𝐿𝐴𝑅𝐺𝐸: 𝐿 = 24, 𝐻 = 1024, 𝐴 = 16, 𝑃𝑎𝑟𝑎𝑚# = 340𝑀
• Pretrained models
• wrapper by Hugging Face
BERT – Pretrained models
• Code for importing pretrained models
BERT – Pretrained models
• Code for importing pretrained models
BERT – Pretrained models
• How is the encoder representation utilized in
downstream tasks?
• The list of the tasks
• Sentence pair classification
• Single sentence classification
• Question answering
• Single sentence tagging
BERT - Fine-tuning
• Fine-tuning example
BERT - Fine-tuning
• Fine-tuning example
BERT - Fine-tuning
• Fine-tuning example
• e.g., run_glue.py: Fine-tuning on GLUE tasks [Wang et al.,
2019] for sequence classification (and right on MRPC task)
BERT - Fine-tuning
• Customizing with own data
BERT - Fine-tuning
• Syntactic and semantic pipelines
• I. Tenny et al., “BERT
Rediscovers the Classical NLP
Pipeline,” in Proc. ACL, 2019.
Explaining BERT
• Multilingual perspective
Explaining BERT
in Korean?
(pretrained models from Google BERT)
• Visualization
• J. Vig, “A Multiscale Visualization of Attention in the
Transformer Model,” in Proc. ACL (demo), 2019.
Explaining BERT
• Limitations
• XLNet
• RoBERTa
• DistilBERT
BERT and after
image from https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8
• Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." in Proc.
NIPS, 2013.
• J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. in Proc.
EMNLP, 2014.
• P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information." arXiv
preprint arXiv:1607.04606, 2016.
• M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word
representations,” in Proc. NAACL, 2018.
• A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all
you need,” in Proc. NIPS, 2017.
• A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding with unsupervised
learning,” Technical report, OpenAI, 2018.
• J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language
understanding” in Proc. NAACL, 2019.
• W. L. Taylor , “Cloze procedure: A new tool for measuring readability,” Journalism Bulletin, vol. 30, no.4, pp. 415–
433, 1953.
• R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual
data,” arXiv preprint arXiv:1511.06709, 2015.
References (order of appearance)
• D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. Le, “SpecAugment: A Simple Data
Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech, 2019.
• A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, "GLUE: A multi-Task benchmark and analysis
platform for natural language understanding," in Proc. ICLR, 2019.
• I. Tenny, D. Das, and E. Pavlick, “BERT rediscovers the classical NLP pipeline,” in Proc. ACL, 2019.
• J. Vig, “A multiscale visualization of attention in the Transformer model,” in Proc. ACL (demo), 2019.
• Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. Le, “XLNet: Generalized autoregressive pretraining
for language understanding,” arXiv preprint arXiv:1906.08237, 2019.
• Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A
robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
• Distilbert (article): https://medium.com/huggingface/distilbert-8cf3380435b5
• Reference code (partially modified for visualization):
https://github.com/huggingface/transformers/blob/master/transformers/modeling_bert.py
https://github.com/huggingface/transformers/blob/master/transformers/tokenization_bert.py
References (order of appearance)
Thank You!!!

More Related Content

What's hot

BERT
BERTBERT
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
shaurya uppal
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Minh Pham
 
Bert
BertBert
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
gohyunwoong
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
Arvind Devaraj
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
Nuwan Sriyantha Bandara
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
Illia Polosukhin
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
Traian Rebedea
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model review
Seoung-Ho Choi
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
SylvainGugger
 
Word embedding
Word embedding Word embedding
Word embedding
ShivaniChoudhary74
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
Khang Pham
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
Leon Dohmen
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
SaiPragnaKancheti
 
Word2Vec
Word2VecWord2Vec
Word2Vec
hyunyoung Lee
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
Arvind Devaraj
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
Young Seok Kim
 

What's hot (20)

BERT
BERTBERT
BERT
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Bert
BertBert
Bert
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model review
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
Word embedding
Word embedding Word embedding
Word embedding
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 

Similar to 1909 BERT: why-and-how (CODE SEMINAR)

An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
zperjaccico
 
Bert.pptx
Bert.pptxBert.pptx
Bert.pptx
Divya Gera
 
BERT MODULE FOR TEXT CLASSIFICATION.pptx
BERT MODULE FOR TEXT CLASSIFICATION.pptxBERT MODULE FOR TEXT CLASSIFICATION.pptx
BERT MODULE FOR TEXT CLASSIFICATION.pptx
ManvanthBC
 
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Jonathan Salwan
 
Assemblers: Ch03
Assemblers: Ch03Assemblers: Ch03
Assemblers: Ch03
desta_gebre
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
Grigory Sapunov
 
고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx
ssuser1e7611
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflow
Keon Kim
 
Language Model.pptx
Language Model.pptxLanguage Model.pptx
Language Model.pptx
Firas Obeid
 
Programming in C.pptx
Programming in C.pptxProgramming in C.pptx
Programming in C.pptx
AvishekBhattacharjee18
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
Kyuri Kim
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech Translation
Chen Xu
 
CS4443 - Modern Programming Language - I Lecture (2)
CS4443 - Modern Programming Language - I  Lecture (2)CS4443 - Modern Programming Language - I  Lecture (2)
CS4443 - Modern Programming Language - I Lecture (2)
Dilawar Khan
 
Unit iii-111206004501-phpapp02
Unit iii-111206004501-phpapp02Unit iii-111206004501-phpapp02
Unit iii-111206004501-phpapp02
riddhi viradiya
 
Introduction to compiler
Introduction to compilerIntroduction to compiler
Introduction to compiler
Abha Damani
 
44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
44CON
 
SS-assemblers 1.pptx
SS-assemblers 1.pptxSS-assemblers 1.pptx
SS-assemblers 1.pptx
kalavathisugan
 
Assembly language programming_fundamentals 8086
Assembly language programming_fundamentals 8086Assembly language programming_fundamentals 8086
Assembly language programming_fundamentals 8086
Shehrevar Davierwala
 
embedded C.pptx
embedded C.pptxembedded C.pptx
embedded C.pptx
mohammedahmed539376
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
Dr. Jaydeep Patil
 

Similar to 1909 BERT: why-and-how (CODE SEMINAR) (20)

An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
 
Bert.pptx
Bert.pptxBert.pptx
Bert.pptx
 
BERT MODULE FOR TEXT CLASSIFICATION.pptx
BERT MODULE FOR TEXT CLASSIFICATION.pptxBERT MODULE FOR TEXT CLASSIFICATION.pptx
BERT MODULE FOR TEXT CLASSIFICATION.pptx
 
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
Sstic 2015 detailed_version_triton_concolic_execution_frame_work_f_saudel_jsa...
 
Assemblers: Ch03
Assemblers: Ch03Assemblers: Ch03
Assemblers: Ch03
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx고급컴파일러구성론_개레_230303.pptx
고급컴파일러구성론_개레_230303.pptx
 
Attention mechanisms with tensorflow
Attention mechanisms with tensorflowAttention mechanisms with tensorflow
Attention mechanisms with tensorflow
 
Language Model.pptx
Language Model.pptxLanguage Model.pptx
Language Model.pptx
 
Programming in C.pptx
Programming in C.pptxProgramming in C.pptx
Programming in C.pptx
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
Utilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech TranslationUtilizing the Pre-trained Model Effectively for Speech Translation
Utilizing the Pre-trained Model Effectively for Speech Translation
 
CS4443 - Modern Programming Language - I Lecture (2)
CS4443 - Modern Programming Language - I  Lecture (2)CS4443 - Modern Programming Language - I  Lecture (2)
CS4443 - Modern Programming Language - I Lecture (2)
 
Unit iii-111206004501-phpapp02
Unit iii-111206004501-phpapp02Unit iii-111206004501-phpapp02
Unit iii-111206004501-phpapp02
 
Introduction to compiler
Introduction to compilerIntroduction to compiler
Introduction to compiler
 
44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
44CON London 2015 - Reverse engineering and exploiting font rasterizers: the ...
 
SS-assemblers 1.pptx
SS-assemblers 1.pptxSS-assemblers 1.pptx
SS-assemblers 1.pptx
 
Assembly language programming_fundamentals 8086
Assembly language programming_fundamentals 8086Assembly language programming_fundamentals 8086
Assembly language programming_fundamentals 8086
 
embedded C.pptx
embedded C.pptxembedded C.pptx
embedded C.pptx
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 

More from WarNik Chow

2312 PACLIC
2312 PACLIC2312 PACLIC
2312 PACLIC
WarNik Chow
 
2311 EAAMO
2311 EAAMO2311 EAAMO
2311 EAAMO
WarNik Chow
 
2211 HCOMP
2211 HCOMP2211 HCOMP
2211 HCOMP
WarNik Chow
 
2211 APSIPA
2211 APSIPA2211 APSIPA
2211 APSIPA
WarNik Chow
 
2211 AACL
2211 AACL2211 AACL
2211 AACL
WarNik Chow
 
2210 CODI
2210 CODI2210 CODI
2210 CODI
WarNik Chow
 
2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inperson
WarNik Chow
 
2206 Modupop!
2206 Modupop!2206 Modupop!
2206 Modupop!
WarNik Chow
 
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset
WarNik Chow
 
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e
WarNik Chow
 
2106 PRSLLS
2106 PRSLLS2106 PRSLLS
2106 PRSLLS
WarNik Chow
 
2106 JWLLP
2106 JWLLP2106 JWLLP
2106 JWLLP
WarNik Chow
 
2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
WarNik Chow
 
2104 Talk @SSU
2104 Talk @SSU2104 Talk @SSU
2104 Talk @SSU
WarNik Chow
 
2103 ACM FAccT
2103 ACM FAccT2103 ACM FAccT
2103 ACM FAccT
WarNik Chow
 
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminar
WarNik Chow
 
2011 NLP-OSS
2011 NLP-OSS2011 NLP-OSS
2011 NLP-OSS
WarNik Chow
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
WarNik Chow
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
WarNik Chow
 
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate Speech
WarNik Chow
 

More from WarNik Chow (20)

2312 PACLIC
2312 PACLIC2312 PACLIC
2312 PACLIC
 
2311 EAAMO
2311 EAAMO2311 EAAMO
2311 EAAMO
 
2211 HCOMP
2211 HCOMP2211 HCOMP
2211 HCOMP
 
2211 APSIPA
2211 APSIPA2211 APSIPA
2211 APSIPA
 
2211 AACL
2211 AACL2211 AACL
2211 AACL
 
2210 CODI
2210 CODI2210 CODI
2210 CODI
 
2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inperson
 
2206 Modupop!
2206 Modupop!2206 Modupop!
2206 Modupop!
 
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset
 
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e
 
2106 PRSLLS
2106 PRSLLS2106 PRSLLS
2106 PRSLLS
 
2106 JWLLP
2106 JWLLP2106 JWLLP
2106 JWLLP
 
2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
 
2104 Talk @SSU
2104 Talk @SSU2104 Talk @SSU
2104 Talk @SSU
 
2103 ACM FAccT
2103 ACM FAccT2103 ACM FAccT
2103 ACM FAccT
 
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminar
 
2011 NLP-OSS
2011 NLP-OSS2011 NLP-OSS
2011 NLP-OSS
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate Speech
 

Recently uploaded

Metrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical EngineeringMetrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical Engineering
leakingvideo
 
Jet Propulsion and its working principle.pdf
Jet Propulsion and its working principle.pdfJet Propulsion and its working principle.pdf
Jet Propulsion and its working principle.pdf
KIET Group of Institutions
 
libro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdflibro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdf
celiosilva66
 
Introduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer NetworkingIntroduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer Networking
Md.Shohel Rana ( M.Sc in CSE Khulna University of Engineering & Technology (KUET))
 
Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in City
sunnuchadda
 
Quadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and ControlQuadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and Control
Blesson Easo Varghese
 
Rockets and missiles notes engineering ppt
Rockets and missiles notes engineering pptRockets and missiles notes engineering ppt
Rockets and missiles notes engineering ppt
archithaero
 
CGR-Unit-1 Basics of Computer Graphics.pdf
CGR-Unit-1 Basics of Computer Graphics.pdfCGR-Unit-1 Basics of Computer Graphics.pdf
CGR-Unit-1 Basics of Computer Graphics.pdf
Rugved Collection
 
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
rawankhanlove256
 
Adv. Digital Signal Processing LAB MANUAL.pdf
Adv. Digital Signal Processing LAB MANUAL.pdfAdv. Digital Signal Processing LAB MANUAL.pdf
Adv. Digital Signal Processing LAB MANUAL.pdf
T.D. Shashikala
 
Response & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITHResponse & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITH
IIIT Hyderabad
 
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
amzhoxvzidbke
 
Online airline reservation system project report.pdf
Online airline reservation system project report.pdfOnline airline reservation system project report.pdf
Online airline reservation system project report.pdf
Kamal Acharya
 
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
amzhoxvzidbke
 
Concepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdfConcepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdf
OBD II
 
Evento anual Splunk .conf24 Highlights recap
Evento anual Splunk .conf24 Highlights recapEvento anual Splunk .conf24 Highlights recap
Evento anual Splunk .conf24 Highlights recap
Rafael Santos
 
FINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATA
FINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATAFINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATA
FINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATA
kevig
 
Disaster Management and Mitigation presentation
Disaster Management and Mitigation presentationDisaster Management and Mitigation presentation
Disaster Management and Mitigation presentation
RajaRamannaTarigoppu
 
DBMS Commands DDL DML DCL ENTITY RELATIONSHIP.pptx
DBMS Commands  DDL DML DCL ENTITY RELATIONSHIP.pptxDBMS Commands  DDL DML DCL ENTITY RELATIONSHIP.pptx
DBMS Commands DDL DML DCL ENTITY RELATIONSHIP.pptx
Tulasi72
 
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdfRed Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
mdfkobir
 

Recently uploaded (20)

Metrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical EngineeringMetrology Book, Bachelors in Mechanical Engineering
Metrology Book, Bachelors in Mechanical Engineering
 
Jet Propulsion and its working principle.pdf
Jet Propulsion and its working principle.pdfJet Propulsion and its working principle.pdf
Jet Propulsion and its working principle.pdf
 
libro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdflibro de modelado de diseño-part-1[160-250].pdf
libro de modelado de diseño-part-1[160-250].pdf
 
Introduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer NetworkingIntroduction to IP address concept - Computer Networking
Introduction to IP address concept - Computer Networking
 
Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Chennai 000XX00000 Provide Best And Top Girl Service And No1 in City
 
Quadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and ControlQuadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and Control
 
Rockets and missiles notes engineering ppt
Rockets and missiles notes engineering pptRockets and missiles notes engineering ppt
Rockets and missiles notes engineering ppt
 
CGR-Unit-1 Basics of Computer Graphics.pdf
CGR-Unit-1 Basics of Computer Graphics.pdfCGR-Unit-1 Basics of Computer Graphics.pdf
CGR-Unit-1 Basics of Computer Graphics.pdf
 
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Mysore 000XX00000 Provide Best And Top Girl Service And No1 in City
 
Adv. Digital Signal Processing LAB MANUAL.pdf
Adv. Digital Signal Processing LAB MANUAL.pdfAdv. Digital Signal Processing LAB MANUAL.pdf
Adv. Digital Signal Processing LAB MANUAL.pdf
 
Response & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITHResponse & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITH
 
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
李易峰祝绪丹做爱视频流出【网芷:ht28.co】可爱学生妹>>>[网趾:ht28.co】]<<<
 
Online airline reservation system project report.pdf
Online airline reservation system project report.pdfOnline airline reservation system project report.pdf
Online airline reservation system project report.pdf
 
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
杨洋李一桐做爱视频流出【网芷:ht28.co】国产国产午夜精华>>>[网趾:ht28.co】]<<<
 
Concepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdfConcepts Basic/ Technical Electronic Material.pdf
Concepts Basic/ Technical Electronic Material.pdf
 
Evento anual Splunk .conf24 Highlights recap
Evento anual Splunk .conf24 Highlights recapEvento anual Splunk .conf24 Highlights recap
Evento anual Splunk .conf24 Highlights recap
 
FINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATA
FINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATAFINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATA
FINE-TUNING OF SMALL/MEDIUM LLMS FOR BUSINESS QA ON STRUCTURED DATA
 
Disaster Management and Mitigation presentation
Disaster Management and Mitigation presentationDisaster Management and Mitigation presentation
Disaster Management and Mitigation presentation
 
DBMS Commands DDL DML DCL ENTITY RELATIONSHIP.pptx
DBMS Commands  DDL DML DCL ENTITY RELATIONSHIP.pptxDBMS Commands  DDL DML DCL ENTITY RELATIONSHIP.pptx
DBMS Commands DDL DML DCL ENTITY RELATIONSHIP.pptx
 
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdfRed Hat Enterprise Linux Administration 9.0 RH124 pdf
Red Hat Enterprise Linux Administration 9.0 RH124 pdf
 

1909 BERT: why-and-how (CODE SEMINAR)

  • 1. BERT – Why and How? HIL SEMINAR 190927 Cho Won Ik
  • 2. • A short overview on recent LMs • BERT • Architecture • Featurization • Training objectives • Representation • Pre-trained models • Fine-tuning • Explaining BERT • BERT and after (+ References) Contents
  • 3. • From n-grams to fastText: (pre-trained) word vectors • N-gram • Non-compressed bunch of words • Word2vec [Mikolov et al., 2013] • Skip-gram & CBOW • Large vocab set projected to a smaller space, with the constraints • GloVe [Pennington et al., 2014] • Co-occurrence matrix combined • fastText [Bojanowski et al., 2016] • Subword information A short overview on recent LMs
  • 4. • Contextualized word embedding • ELMo [Peters et al., 2018] A short overview on recent LMs image from https://jalammar.github.io/illustrated-bert/
  • 5. • And how transformers engaged in: • All you need is attention! [Vaswani et al., 2017] • OpenAI GPT [Radford et al., 2018] (image source) A short overview on recent LMs
  • 6. • A new powerful self-supervised pretrained LM! • Jacob Devlin et al., "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. • Original Google Research implementation (TensorFlow) • https://github.com/google-research/bert • Hugging Face implementation on Transformer-based architectures (PyTorch) – Reference code • in https://github.com/huggingface/pytorch-transformers • Currently changed to https://github.com/huggingface/transformers (updated 2019.09.26!) • BERT, GPT, GPT-2, Transformer-XL, XLNet, XLM, RoBERTa, DistilBERT A short overview on recent LMs
  • 7. • BERT has only an encoder • No explicit decoder! • BERT uses the encoder from Transformer – What’s same & what’s different? • Same • The module & submodule structure (multi-head attention, scale-dot product, in-out framework etc.) • Different • No unidirectional restriction in the output layer while training • Why BERT is ``B’’ERT • Two simple tasks (will be explained) BERT - Architecture
  • 8. • Setup • Let’s pretrain via simple task to utilize in downstream tasks BERT - Architecture
  • 9. • Setup • Input: concatenation of two segments (sequences of tokens) 𝑥1, … , 𝑥 𝑁 and 𝑦1, … , 𝑦 𝑀 • 𝐶𝐿𝑆 𝑥1, … , 𝑥 𝑁 𝑆𝐸𝑃 𝑦1, … , 𝑦 𝑀 [𝑆𝐸𝑃] • [CLS] trained to represent the value for classification • [SEP] denotes the separating point between the sequences • Each segment – equal to or more than one natural ‘sentences’ • 𝑁 + 𝑀 < 𝑇 for 𝑇 the parameter that controls the maximum sequence length BERT - Architecture (BERT original paper)
  • 10. • Setup • Input representation BERT - Architecture
  • 11. • Code for BERT architecture BERT - Architecture
  • 12. • Code for BERT architecture (cont’d) BERT - Architecture
  • 13. • Code for BERT architecture (cont’d) BERT - Architecture
  • 14. • Code for BERT architecture (cont’d) BERT - Architecture
  • 15. • Hugging Face BERT has two options: • Basic (punctuation splitting, lower casing, etc.) • WordPiece • Tokenizes a piece of text into its word pieces • A greedy longest-match-first algorithm to perform tokenization using the given vocabulary • e.g., input = “unaffable” / output = [“un”, “##aff”, “##able”] • Args: text • A single token or whitespace separated tokens which should have already been passed through `BasicTokenizer` • Returns: • A list of wordpiece tokens. BERT - Featurization
  • 16. • Code for BERT tokenizer: punctuation split + WPM BERT - Featurization
  • 17. • Code for basic tokenizer BERT - Featurization
  • 18. • Code for WPM [Sennrich et al., 2015] BERT - Featurization
  • 19. • Code for BERT embedding BERT - Featurization
  • 20. • Code for BERT embedding (cont’d) BERT - Featurization
  • 21. • BERT learns how to represent the context • via a hard training with two simple tasks • Masked LM (MLM) • Cloze task [Taylor, 1953] – filling in the blank • Similar to what SpecAugment [Park et al., 2019] does? • Next sentence prediction (NSP) • Checks the relevance between the sentences BERT – Training objectives
  • 22. • Masked language model (MLM) • A random sample of the tokens in the input sequence is selected and replaced with the special token [MASK] • MLM objective: cross-entropy loss on predicting the masked tokens • BERT uniformly selects 15% of input tokens for possible replacement • Of the selected tokens, 80% are replaced with [MASK] • 10% left unchanged • 10% replaced by a randomly selected vocabulary token BERT – Training objectives (THESE code from Google BERT)
  • 23. • Masked language model (MLM) • masked_lm_labels: (`optional`) ``torch.LongTensor`` of shape ``(batch_size, sequence_length)``: • Labels for computing the masked language modeling loss • Indices should be in ``[-1, 0, ..., config.vocab_size]`` • Tokens with indices set to ``-1`` are ignored (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]`` • Example: BERT – Training objectives
  • 24. • Masked language model (MLM) • Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs: • loss: (`optional`, returned when ``masked_lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``: MLM loss • prediction_scores: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, config.vocab_size)``: Prediction scores of the LM head (scores for each vocabulary token before SoftMax) • hidden_states: (`optional`, returned when ``config.output_hidden_states=True``): list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings) of shape ``(batch_size, sequence_length, hidden_size)``: Hidden-states of the model at the output of each layer plus the initial embedding outputs BERT – Training objectives
  • 25. • Masked language model (MLM) • Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs (cont’d): • attentions: (`optional`, returned when ``config.output_attentions = True``) list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``: Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads BERT – Training objectives
  • 26. • Masked language model (MLM) BERT – Training objectives
  • 27. • Masked language model (MLM) (cont’d) BERT – Training objectives
  • 28. • Next sentence prediction (NSP) • next_sentence_label: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``: • Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see ``input_ids`` docstring) • Indices should be in ``[0, 1]``. • ``0`` indicates sequence B is a continuation of sequence A • ``1`` indicates sequence B is a random sequence. • Example: BERT – Training objectives
  • 29. • Next sentence prediction (NSP) • Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs: • loss: (`optional`, returned when ``next_sentence_label`` is provided) ``torch.FloatTensor`` of shape ``(1,)``: NSP (classification) loss • seq_relationship_scores: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, 2)``: Prediction scores of the NSP head (scores of True/False continuation before SoftMax). • hidden_states: (`optional`, returned when ``config.output_hidden_states=True``): list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings) of shape ``(batch_size, sequence_length, hidden_size)``: Hidden-states of the model at the output of each layer plus the initial embedding outputs. • attentions: (`optional`, returned when ``config.output_attentions=True``) list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``: Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. BERT – Training objectives
  • 30. • Next sentence prediction (NSP) • Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs: • attentions: (`optional`, returned when ``config.output_attentions=True``) list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``: Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. BERT – Training objectives
  • 31. • Next sentence prediction (NSP) BERT – Training objectives
  • 32. • Code for BERT output layers • SelfAttention • SelfOutput • Attention • Intermediate • Output • Encoder BERT - Representation
  • 33. • Code for BERT output layers BERT - Representation image from https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
  • 34. • Code for BERT output layers (cont’d) BERT - Representation
  • 35. • Code for BERT output layers (cont’d) BERT - Representation
  • 36. • Code for BERT output layers (cont’d) BERT - Representation
  • 37. • Code for BERT output layers (cont’d) BERT - Representation
  • 38. • Code for BERT output layers (cont’d) BERT - Representation
  • 39. • Code for BERT output layers (cont’d) BERT - Representation
  • 40. • BERT pretraining BERT - Representation
  • 41. • BERT pretraining (cont’d) BERT - Representation
  • 42. • What does BERT encode? • A mixture of syntactic, semantic and possibly pre-phonetic information? • In what form is information encoded? BERT - Representation
  • 43. • BERT Base and Large • L: the number of layers • H: the hidden size • A: the number of self-attention heads • Training corpus: BOOKCORPUS + WIKIPEDIA (TOTAL 16GB) • 𝐵𝐸𝑅𝑇𝐵𝐴𝑆𝐸: 𝐿 = 12, 𝐻 = 768, 𝐴 = 12, 𝑃𝑎𝑟𝑎𝑚# = 110𝑀 • 𝐵𝐸𝑅𝑇𝐿𝐴𝑅𝐺𝐸: 𝐿 = 24, 𝐻 = 1024, 𝐴 = 16, 𝑃𝑎𝑟𝑎𝑚# = 340𝑀 • Pretrained models • wrapper by Hugging Face BERT – Pretrained models
  • 44. • Code for importing pretrained models BERT – Pretrained models
  • 45. • Code for importing pretrained models BERT – Pretrained models
  • 46. • How is the encoder representation utilized in downstream tasks? • The list of the tasks • Sentence pair classification • Single sentence classification • Question answering • Single sentence tagging BERT - Fine-tuning
  • 49. • Fine-tuning example • e.g., run_glue.py: Fine-tuning on GLUE tasks [Wang et al., 2019] for sequence classification (and right on MRPC task) BERT - Fine-tuning
  • 50. • Customizing with own data BERT - Fine-tuning
  • 51. • Syntactic and semantic pipelines • I. Tenny et al., “BERT Rediscovers the Classical NLP Pipeline,” in Proc. ACL, 2019. Explaining BERT
  • 52. • Multilingual perspective Explaining BERT in Korean? (pretrained models from Google BERT)
  • 53. • Visualization • J. Vig, “A Multiscale Visualization of Attention in the Transformer Model,” in Proc. ACL (demo), 2019. Explaining BERT
  • 54. • Limitations • XLNet • RoBERTa • DistilBERT BERT and after image from https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8
  • 55. • Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." in Proc. NIPS, 2013. • J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. in Proc. EMNLP, 2014. • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information." arXiv preprint arXiv:1607.04606, 2016. • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proc. NAACL, 2018. • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. NIPS, 2017. • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding with unsupervised learning,” Technical report, OpenAI, 2018. • J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding” in Proc. NAACL, 2019. • W. L. Taylor , “Cloze procedure: A new tool for measuring readability,” Journalism Bulletin, vol. 30, no.4, pp. 415– 433, 1953. • R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” arXiv preprint arXiv:1511.06709, 2015. References (order of appearance)
  • 56. • D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech, 2019. • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, "GLUE: A multi-Task benchmark and analysis platform for natural language understanding," in Proc. ICLR, 2019. • I. Tenny, D. Das, and E. Pavlick, “BERT rediscovers the classical NLP pipeline,” in Proc. ACL, 2019. • J. Vig, “A multiscale visualization of attention in the Transformer model,” in Proc. ACL (demo), 2019. • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. Le, “XLNet: Generalized autoregressive pretraining for language understanding,” arXiv preprint arXiv:1906.08237, 2019. • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. • Distilbert (article): https://medium.com/huggingface/distilbert-8cf3380435b5 • Reference code (partially modified for visualization): https://github.com/huggingface/transformers/blob/master/transformers/modeling_bert.py https://github.com/huggingface/transformers/blob/master/transformers/tokenization_bert.py References (order of appearance)