5_BERT.pdf

國立臺北護理健康大學 NTUNHS
BERT
Orozco Hsu
2022-06-06
1

About me
• Education
• NCU (MIS)、NCCU (CS)
• Work Experience
• Telecom big data Innovation
• AI projects
• Retail marketing technology
• User Group
• TW Spark User Group
• TW Hadoop User Group
• Taiwan Data Engineer Association Director
• Research
• Big Data/ ML/ AIOT/ AI Columnist
2

Tutorial
Content
3
Attention layer
Homework
Seq2seq introduction
Transformer model
BERT

Code
• Download code
• https://github.com/orozcohsu/ntunhs_2022_01.git
• Folder/file
• 20220523_inter_master/run.ipynb
4

Code
5
Click button
Open it with Colab
Copy it to your
google drive
Check your google
drive

Seq2seq
• Seq2seq is a family of machine learning approaches used for natural
language processing (NLP)
• Applications include
• language translation
• image captioning
• conversational models
• text summarization
6

Seq2seq
• Seq2seq turns one sequence into
another sequence
• sequence transformation
• The primary components are one
encoder and one decoder network
• It often applies a recurrent neural
network (RNN) or more often LSTM
or GRU to avoid the problem of
vanishing gradient
7

Seq2seq
• The context for each sentence is the output from the previous step
• The encoder turns each sentence into a corresponding hidden vector
containing its context
• The decoder reverses the process, turning the vector into an output
item, using the previous output as the input context
8
Ref: https://en.wikipedia.org/wiki/Seq2seq

Attention for Seq2seq
• Attention tremendously improves Seq2seq model
• With attention , Seq2seq model does not forget context input
• With attention, the decoder knows which part of context needs to be
noticed
• Consuming lots of computation
11

20
Ref: Attention and Augmented Recurrent Neural Networks (distill.pub)

21
Ref: Attention and Augmented Recurrent Neural Networks (distill.pub)

Attention is all you need
22
Ref: 1706.03762.pdf (arxiv.org)
Publish 2017

Transformer model
• Transformer is a kind of seq2seq model (encoder/decoder)
• Transformer is NOT RNN
• Transformer only has attention and dense layers (No RNN)
• Very good performance in language translation
• Transformer
23

Transformer model
24
Multi-head: self-attention layers

Self-Attention Layer
• Inputs: X = [x1, x2, … xm]
• Parameters: WQ, WK, WV (don’t share parameters)
28
This is called single-head self-attention

29
Fully connected layer

30
Stack one more

Transformer model
31
Multi-head self-attention
(Same input X, concatenation output of self-attention)
Each word has 512-dim
word embedding
Input vs output are
same 512-dim

Transformer model
32
One Block of Decoder
From encoder output

Transformer model
33
Encoder and Decoder

Transformer model
• Language translation
• Google translation
• http://www.manythings.org/anki/
• Auto text generation
• Demo – InferKit
34

BERT
• Bidirectional Encoder Representation
from Transformers (BERT)
• BERT = Encoder of transformer
• Learning from a large amount of text
without annotation
• For Chinese inputs, using character is
better than vocabulary
37
Encoder

BERT
38
Google’s BERT - NLP and Transformer Architecture That Are Reshaping AI Landcape (neptune.ai)
RNN based

Transformer model
39
(beta) Dynamic Quantization on BERT — PyTorch Tutorials 1.11.0+cu102 documentation
Small BERT: 24 layers
Big BERT: 48 layers
BERT
(from scratch or transfer learning)
(A classifier)

BERT
• BERT is good for NLU process
• Text classification
• Sentiment analysis
40

Exercise
• Try to run this code
41
ktrain01 (Day master only)
ktrain02 (International master class only)

Homework
• Try to use BERT and build a text classification model
• Prepare a multi-labels dataset
42
Hint:
For multi-labels text classification, the amount of dataset is the key of performance
ktrain_multi-labels_bad_performance

5_BERT.pdf

Recommended

Recommended

More Related Content

Similar to 5_BERT.pdf

Similar to 5_BERT.pdf (20)

More from FEG

More from FEG (20)

Recently uploaded

Recently uploaded (20)

5_BERT.pdf