The Illustrated BERT,
ELMo, and co.
http://jalammar.github.io/illustrated-bert/
전상현
오늘 알아가야 할 것
NLP’s Imagenet Moment
Transfer Learning
BERT
Transformer?
Imagenet Moment
Transfer Learning
http://ruder.io/nlp-imagenet/
https://www.slideshare.net/iljakuzovkin/paper-overview-deep-residual-learning-for-image-recognition
Transfer Learning
• In practice, very few people train an entire Convolutional Network from scratch (with
random initialization), because it is relatively rare to have a dataset of sufficient size.
Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet,
which contains 1.2 million images with 1000 categories), and then use the ConvNet
either as an initialization or a fixed feature extractor for the task of interest.
https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a
• Pretrained ImageNet models have been used to achieve state-of-the-art
results in tasks such as object detection, semantic segmentation, human pose
estimation, and video recognition
https://www.slideshare.net/iljakuzovkin/paper-overview-deep-residual-learning-for-image-recognition
NLP’s Imagenet
moment
http://ruder.io/nlp-imagenet/
From Shallow to Deep Pre-Training
• At the core of the recent
advances of ULMFiT, ELMo, and
the OpenAI transformer is one
key paradigm shift: going from
just initializing the first layer of
our models to pretraining the
entire model with hierarchical
representations

• It is very likely that in a year’s
time NLP practitioners will
download pretrained language
models rather than pretrained
word embeddings
http://ruder.io/nlp-imagenet/
Imagenet Moment
dd
학습된 모형을 다운로드 각자 목적에 맞게 학습
https://medium.com/@RaghavPrabhu/understanding-of-convolutional-neural-network-cnn-deep-learning-99760835f148
NLP’s Imagenet Moment
학습된 모형을 다운로드 각자 목적에 맞게 학습
https://github.com/google-research/bert
https://medium.com/@RaghavPrabhu/understanding-of-convolutional-neural-network-cnn-deep-learning-99760835f148
BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding
Bidirectional Encoder Representations
from Transformers
BERT / GPT / ELMo
Input Representation
• WordPiece embedding

• Positional embeddings with supported sequence length
up to 512 tokens

•
WordPieceModel
https://github.com/google/sentencepiece
we do not use traditional left-to-
right or right-to-left language
models to pre-train BERT.
Task #1: Masked LM
Task #2: Next Sentence Prediction
Unsupervised Learning
Training of BERTBASE was performed on 4
Cloud TPUs in Pod configuration (16 TPU chips
total). Training of BERTLARGE was performed
on 16 Cloud TPUs (64 TPU chips total). Each
pre- training took 4 days to complete.
GLUE
SQuAD 1.1
RNN과 Transformer의 차이점
1. RNN은 순서대로 연산을 해야 해서 병렬화가 어려움
2. Transformer(Attention)을 사용하면 병렬화 가능
3. 병렬화가 가능해서 속도도 빠르고, 

더 깊고 넓은 네트워크 학습이 가능해짐
http://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/http://jalammar.github.io/illustrated-transformer/
Exploring Randomly Wired Neural Networks
for Image Recognition
결국 중요한 것은 연결을 하나하나 어떻게
하느냐 보다는 큰(넓고 깊은) 네트워크가 잘
학습될 수 있는 구조를 만들고 많은 데이터
를 가지고 잘 학습시키면 된다.
이제 본론으로…
http://jalammar.github.io/illustrated-bert/
나중에(이어서?) 볼 것.
http://jalammar.github.io/illustrated-transformer/
http://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-
seq2seq-models-with-attention/

BERT

  • 1.
    The Illustrated BERT, ELMo,and co. http://jalammar.github.io/illustrated-bert/ 전상현
  • 2.
    오늘 알아가야 할것 NLP’s Imagenet Moment Transfer Learning BERT Transformer?
  • 3.
  • 4.
  • 5.
    Transfer Learning • Inpractice, very few people train an entire Convolutional Network from scratch (with random initialization), because it is relatively rare to have a dataset of sufficient size. Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories), and then use the ConvNet either as an initialization or a fixed feature extractor for the task of interest. https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a
  • 6.
    • Pretrained ImageNetmodels have been used to achieve state-of-the-art results in tasks such as object detection, semantic segmentation, human pose estimation, and video recognition https://www.slideshare.net/iljakuzovkin/paper-overview-deep-residual-learning-for-image-recognition
  • 7.
  • 8.
  • 9.
    From Shallow toDeep Pre-Training • At the core of the recent advances of ULMFiT, ELMo, and the OpenAI transformer is one key paradigm shift: going from just initializing the first layer of our models to pretraining the entire model with hierarchical representations • It is very likely that in a year’s time NLP practitioners will download pretrained language models rather than pretrained word embeddings http://ruder.io/nlp-imagenet/
  • 10.
    Imagenet Moment dd 학습된 모형을다운로드 각자 목적에 맞게 학습 https://medium.com/@RaghavPrabhu/understanding-of-convolutional-neural-network-cnn-deep-learning-99760835f148
  • 11.
    NLP’s Imagenet Moment 학습된모형을 다운로드 각자 목적에 맞게 학습 https://github.com/google-research/bert https://medium.com/@RaghavPrabhu/understanding-of-convolutional-neural-network-cnn-deep-learning-99760835f148
  • 12.
    BERT: Pre-training ofDeep Bidirectional Transformers for Language Understanding Bidirectional Encoder Representations from Transformers
  • 13.
    BERT / GPT/ ELMo
  • 14.
    Input Representation • WordPieceembedding • Positional embeddings with supported sequence length up to 512 tokens •
  • 15.
  • 16.
    we do notuse traditional left-to- right or right-to-left language models to pre-train BERT. Task #1: Masked LM Task #2: Next Sentence Prediction Unsupervised Learning
  • 17.
    Training of BERTBASEwas performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total). Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pre- training took 4 days to complete.
  • 18.
  • 19.
  • 20.
    RNN과 Transformer의 차이점 1.RNN은 순서대로 연산을 해야 해서 병렬화가 어려움 2. Transformer(Attention)을 사용하면 병렬화 가능 3. 병렬화가 가능해서 속도도 빠르고, 
 더 깊고 넓은 네트워크 학습이 가능해짐 http://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/http://jalammar.github.io/illustrated-transformer/
  • 21.
    Exploring Randomly WiredNeural Networks for Image Recognition
  • 22.
    결국 중요한 것은연결을 하나하나 어떻게 하느냐 보다는 큰(넓고 깊은) 네트워크가 잘 학습될 수 있는 구조를 만들고 많은 데이터 를 가지고 잘 학습시키면 된다.
  • 23.
    이제 본론으로… http://jalammar.github.io/illustrated-bert/ 나중에(이어서?) 볼것. http://jalammar.github.io/illustrated-transformer/ http://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of- seq2seq-models-with-attention/