國立臺北護理健康大學 NTUNHS
BERT
Orozco Hsu
2022-06-06
1
About me
• Education
• NCU (MIS)、NCCU (CS)
• Work Experience
• Telecom big data Innovation
• AI projects
• Retail marketing technology
• User Group
• TW Spark User Group
• TW Hadoop User Group
• Taiwan Data Engineer Association Director
• Research
• Big Data/ ML/ AIOT/ AI Columnist
2
Tutorial
Content
3
Attention layer
Homework
Seq2seq introduction
Transformer model
BERT
Code
• Download code
• https://github.com/orozcohsu/ntunhs_2022_01.git
• Folder/file
• 20220523_inter_master/run.ipynb
4
Code
5
Click button
Open it with Colab
Copy it to your
google drive
Check your google
drive
Seq2seq
• Seq2seq is a family of machine learning approaches used for natural
language processing (NLP)
• Applications include
• language translation
• image captioning
• conversational models
• text summarization
6
Seq2seq
• Seq2seq turns one sequence into
another sequence
• sequence transformation
• The primary components are one
encoder and one decoder network
• It often applies a recurrent neural
network (RNN) or more often LSTM
or GRU to avoid the problem of
vanishing gradient
7
Seq2seq
• The context for each sentence is the output from the previous step
• The encoder turns each sentence into a corresponding hidden vector
containing its context
• The decoder reverses the process, turning the vector into an output
item, using the previous output as the input context
8
Ref: https://en.wikipedia.org/wiki/Seq2seq
Seq2seq
9
Seq2seq
10
Attention for Seq2seq
• Attention tremendously improves Seq2seq model
• With attention , Seq2seq model does not forget context input
• With attention, the decoder knows which part of context needs to be
noticed
• Consuming lots of computation
11
Attention for Seq2seq
12
Attention for Seq2seq
13
Attention for Seq2seq
14
Attention for Seq2seq
15
Attention for Seq2seq
16
Attention for Seq2seq
17
Attention for Seq2seq
18
Attention for Seq2seq
19
Attention for Seq2seq
20
Ref: Attention and Augmented Recurrent Neural Networks (distill.pub)
Attention for Seq2seq
21
Ref: Attention and Augmented Recurrent Neural Networks (distill.pub)
Attention is all you need
22
Ref: 1706.03762.pdf (arxiv.org)
Publish 2017
Transformer model
• Transformer is a kind of seq2seq model (encoder/decoder)
• Transformer is NOT RNN
• Transformer only has attention and dense layers (No RNN)
• Very good performance in language translation
• Transformer
23
Transformer model
24
Multi-head: self-attention layers
Self-Attention Layer
25
Self-Attention Layer
26
Self-Attention Layer
27
Self-Attention Layer
• Inputs: X = [x1, x2, … xm]
• Parameters: WQ, WK, WV (don’t share parameters)
28
This is called single-head self-attention
Self-Attention Layer
29
Fully connected layer
Self-Attention Layer
30
Stack one more
Transformer model
31
Multi-head self-attention
(Same input X, concatenation output of self-attention)
Each word has 512-dim
word embedding
Input vs output are
same 512-dim
Transformer model
32
One Block of Decoder
From encoder output
Transformer model
33
Encoder and Decoder
Transformer model
• Language translation
• Google translation
• http://www.manythings.org/anki/
• Auto text generation
• Demo – InferKit
34
35
36
BERT
• Bidirectional Encoder Representation
from Transformers (BERT)
• BERT = Encoder of transformer
• Learning from a large amount of text
without annotation
• For Chinese inputs, using character is
better than vocabulary
37
Encoder
BERT
38
Google’s BERT - NLP and Transformer Architecture That Are Reshaping AI Landcape (neptune.ai)
RNN based
Transformer model
39
(beta) Dynamic Quantization on BERT — PyTorch Tutorials 1.11.0+cu102 documentation
Small BERT: 24 layers
Big BERT: 48 layers
BERT
(from scratch or transfer learning)
(A classifier)
BERT
• BERT is good for NLU process
• Text classification
• Sentiment analysis
40
Exercise
• Try to run this code
41
ktrain01 (Day master only)
ktrain02 (International master class only)
Homework
• Try to use BERT and build a text classification model
• Prepare a multi-labels dataset
42
Hint:
For multi-labels text classification, the amount of dataset is the key of performance
ktrain_multi-labels_bad_performance

5_BERT.pdf