This document provides an introduction and overview of sequence-to-sequence (seq2seq) models, transformer models, attention mechanisms, and BERT for natural language processing. It discusses applications of seq2seq models like language translation and text summarization. Key aspects covered include the encoder-decoder architecture of seq2seq models, how attention improves seq2seq by allowing the model to focus on relevant parts of the context, and the transformer architecture using self-attention rather than recurrent layers. BERT is introduced as a bidirectional transformer model pre-trained on large unlabeled text that achieves state-of-the-art results for a range of NLP tasks. Code examples and homework suggestions are also provided.
2. About me
• Education
• NCU (MIS)、NCCU (CS)
• Work Experience
• Telecom big data Innovation
• AI projects
• Retail marketing technology
• User Group
• TW Spark User Group
• TW Hadoop User Group
• Taiwan Data Engineer Association Director
• Research
• Big Data/ ML/ AIOT/ AI Columnist
2
6. Seq2seq
• Seq2seq is a family of machine learning approaches used for natural
language processing (NLP)
• Applications include
• language translation
• image captioning
• conversational models
• text summarization
6
7. Seq2seq
• Seq2seq turns one sequence into
another sequence
• sequence transformation
• The primary components are one
encoder and one decoder network
• It often applies a recurrent neural
network (RNN) or more often LSTM
or GRU to avoid the problem of
vanishing gradient
7
8. Seq2seq
• The context for each sentence is the output from the previous step
• The encoder turns each sentence into a corresponding hidden vector
containing its context
• The decoder reverses the process, turning the vector into an output
item, using the previous output as the input context
8
Ref: https://en.wikipedia.org/wiki/Seq2seq
11. Attention for Seq2seq
• Attention tremendously improves Seq2seq model
• With attention , Seq2seq model does not forget context input
• With attention, the decoder knows which part of context needs to be
noticed
• Consuming lots of computation
11
22. Attention is all you need
22
Ref: 1706.03762.pdf (arxiv.org)
Publish 2017
23. Transformer model
• Transformer is a kind of seq2seq model (encoder/decoder)
• Transformer is NOT RNN
• Transformer only has attention and dense layers (No RNN)
• Very good performance in language translation
• Transformer
23
37. BERT
• Bidirectional Encoder Representation
from Transformers (BERT)
• BERT = Encoder of transformer
• Learning from a large amount of text
without annotation
• For Chinese inputs, using character is
better than vocabulary
37
Encoder
38. BERT
38
Google’s BERT - NLP and Transformer Architecture That Are Reshaping AI Landcape (neptune.ai)
RNN based
39. Transformer model
39
(beta) Dynamic Quantization on BERT — PyTorch Tutorials 1.11.0+cu102 documentation
Small BERT: 24 layers
Big BERT: 48 layers
BERT
(from scratch or transfer learning)
(A classifier)
40. BERT
• BERT is good for NLU process
• Text classification
• Sentiment analysis
40
41. Exercise
• Try to run this code
41
ktrain01 (Day master only)
ktrain02 (International master class only)
42. Homework
• Try to use BERT and build a text classification model
• Prepare a multi-labels dataset
42
Hint:
For multi-labels text classification, the amount of dataset is the key of performance
ktrain_multi-labels_bad_performance