Natural language Processing (NLP)
and Transformer Models
Ding Li 2021.11
Use 100 ~ 1K dimensions to represent each word
Basic word embedding methods
• Word2vec (Google, 2013)
• Glove (Stanford, 2014)
• FastText (Facebook, 2016)
Continuous bag-of-words method (CBOW)
• Sliding window to select context words and center word
• Average context words as input to predict center word
• Self-supervised learning, mass corpus as training data
Python code
Input Word
one hot vector
Word Embedding embedding
One hot vector
One hot vector
Recurrent Neural Networks (RNN): keep information
Python code
want? response?
GRU help to preserve important information
Long Short-Term Memory (LSTM): same purpose
Name Entity Recognition
B: Token begins an entity I: Token is inside an entity O: Others
Sharon Floyd flew to Miami on Friday
B-per I-per O O B-geo O B-tim
Encoder and Decoder Structure
How are the results?
Wie sind die Ergebnisse?
Problem: as sequence size increases, performance decreases
Attention: Word Alignment
information step by
step with
disambiguation and
score it
Encode/Decode Attention: which key word is most relevant to query?
For languages with
different grammar
structures, attention
still looks at the
correct token between
Sampling for next word
Greedy decoding: select the most probable word at each step
Beam search: a brooder, more exploratory decoding alternative
Minimum Bayes Risk: compare many samples against each other,
select sample with the highest similarity
Python code
Info loss
Key (K)
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 =
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇
Q: linear transformed
from output
K, V: linear transformed
from input
RNN: calculation must happen in sequence
Positional Encoding: add positional info to words
Transformer: parallel computing for all words Multi-headed Attention
Causal Attention (Self-Attention)
• Queries and Keys are words from the same sentence
• Queries should only be allowed to look at words before
• Find words deserve more attention
linear transformation
• Each head uses
different linear
transformations to
represent words
• Different heads can
learn different
between words
Transformer Decoder
Python code
Online Summarization Tool
transformers GitHub
Create the query Q , key K, and value V
by multiplying the input matrix X, with weight matrics Wq, Wk, and Wv
Self Attention The meaning of a word can come from other words in sentence:
Bidirectional Encoder Representations from Transformers)
Transfer Learning
Pre-training (base model 110M parameters, large model 340M)
Pre-training basic model with massive data
Fine-turning models for different applications
Mask Language Modeling (MLM)
Next Sentence Prediction (NSP)
The legislators
believed that they
were on the right
side of history.
So they changed the law.
Then the bunny ate the carrot.
Pre-training data
• Books Corpus (800M words)
• English Wikipedia (2,500M words, ~13G)
Fine-turning and Data Input
Pre-training Sentence A Sentence B
Input Result
Classification Text None Sentiment pos/neg?
Grammar correct?
Question Passage Answer or location in passage
Summary Article Summary Summary of the article
Natural Language
Hypothesis Premise Entailment, contraction, neutral?
Natural language
inference is the task of
determining whether a
“hypothesis” is true
(entailment), false
(contradiction), or
(neutral) given a
Name Entity
Sentence Entities Entities and tags
Paraphrase Sentence Paraphrase Paraphrase of the sentence
Bert GitHub Python
Paper 2019
The paper uses the Medical Information Mart
for Intensive Care III (MIMIC-III) dataset.
MIMIC-III consists of the electronic health records
of 58,976 unique hospital admissions from 38,597
patients in the intensive care unit of the Beth
Israel Deaconess Medical Center between 2001
and 2012. There are 2,083,180 de-identified notes
associated with the admissions.
ClinicalBERT accurately predicts 30-day
readmission using discharge summaries.
AUROC: Area under the receiver operating characteristic curve
AUPRC: Area under the precision-recall curve
PR80: Recall at precision of 80%
ClinicalBert paper
BioBert: trained with PubMed abstracts (PubMed) and/or PubMed Central full-text articles (PMC) GitHub
Text-to-Text Transfer Transformer)
Unified Multi-Task Framework: Text as Input, Text as Output
Cola: Corpus of Linguistic Acceptability
STSB: Semantic Textual Similarity Benchmark
RTE: Recognizing Textual Entailment
MNLI: Multi-Genre Natural Language Inference
MRPC: Microsoft Research Paraphrase Corpus
SQuAD: Stanford Question Answering Dataset
WMT English to German
COPA: Choice of Plausible Alternatives, causal reasoning
MultiRC: Multi-Sentence Reading Comprehension
Word in Context
WSC: Winograd Schema Challenge, resolve ambiguity
The city councilmen refused the demonstrators a permit
because they [feared/advocated] violence.
Question: “they” refers to?
Transfer Learning with C4 – Colossal Cleaned Crawl Corpus (~800G), base model with 220M parameters, large model 770M, largest 11B
T5 GitHub Paper 2020 Python
Language Model Meta-Learning
Larger Models Make Increasingly Efficient Use of In-
Context Information
Datasets Used to Train GPT-3
Model Size ~ TriviaQA Performance
SAT Analogies (65% ~ avg applications 57%)
Gu 2021 models
Less domain
More domain
Model Model Full Name Vocabulary Training Size
BERT bert-base-uncased Wiki + Books 16G
RoBERTa roberta-base Web Crawl 160G
PubMedBert microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext PubMed 21G
Self-Alignment Pretraining for Biomedical Entity Representations SapBert GitHub Model
Liu 2021
Figure 1: The t-SNE visualization of UMLS entities under PUBMEDBERT (BERT pretrained on
PubMed papers) & PUBMEDBERT+ SAPBERT (PUBMEDBERT further pretrained on UMLS
synonyms). The biomedical names of different concepts are hard to separate in the
heterogeneous embedding space (left). After the self-alignment pretraining, the same
concept’s entity names are drawn closer to form compact clusters (right).
Pertaining with UMLS (Unified Medical Language System)
4M+ concepts & 10M+ synonyms (MeSH, SNOMED, RxNorm, Gene Ontology, & OMIM)
Hard Pairs Mining (𝑥𝑎, 𝑥𝑝, 𝑥𝑛)
𝑥𝑎: anchor; 𝑥𝑝: positive synonym match; 𝑥𝑛 : negative synonym match
Only consider triplets with the negative sample closer to the positive sample by a margin of λ.
Loss Function
S: similarity matrix among 𝜒𝑏 items in batch b
Negative pair similarity
should be small
Positive pair similarity
should be large
Radford 2021 GitHub
• We demonstrate that the simple pre-training
task of predicting which caption goes with
which image is an efficient and scalable way
to learn SOTA image representations from
scratch on a dataset of 400 million (image,
text) pairs collected from the internet.
• After pre-training, natural language is used to
reference learned visual concepts (or describe
new ones) enabling zero-shot transfer of the
model to downstream tasks.
• We study the performance of this approach by
benchmarking on over 30 different existing
computer vision datasets, spanning tasks such
as OCR, action recognition in videos, geo-
localization, and many types of fine-grained
object classification.
• The model transfers non-trivially to most tasks
and is often competitive with a fully
supervised baseline without the need for any
dataset specific training.
Masked Autoencoders (MAE) Are Scalable Vision Learners He 2021
Figure 1. Our MAE architecture. During pre-training, a large random
subset of image patches (e.g., 75%) is masked out. The encoder is
applied to the small subset of visible patches. Mask tokens are
introduced after the encoder, and the full set of encoded patches and
mask tokens is processed by a small decoder that reconstructs the
original image in pixels. After pre-training, the decoder is discarded,
and the encoder is applied to uncorrupted images to produce
representations for recognition tasks.
Figure 4. Reconstructions of ImageNet validation images using an MAE
pre-trained with a masking ratio of 75% but applied on inputs with
higher masking ratios. The predictions differ plausibly from the original
images, showing that the method can generalize.
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Baevski 2022
 Self-supervised learning makes all human’s text as machine’s potential training data.
 Machines are not only trained with text’s meaning and semantics, but also reasoning.
 Models with billions of parameters are increasing their sophisticated capabilities fast.
 Coursera
Natural Language Processing Specialization
Applied Text Mining in Python
 Books
Getting Started with Google BERT
 Papers
Attention Is All You Need
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
Pretrained Transformers for Text Ranking: BERT and Beyond
 Blogs
Illustrated: Self-Attention
Natural language inference
Keyword Extraction: from TF-IDF to BERT
Understanding searches better than ever before
 Projects
Bert Extractive Summarizer
 Colab
A Visual Notebook to Using BERT for the First Time (blog)
1. Count word frequency in all training tweets
Word in
All Tweets
Counts in
Positive Tweets
Counts in
Negative Tweets
Happy 305 87
Hard 66 217
NLP 34 29
Learning 18 13
2. Sum the frequency for each tweet
Counts in
Positive Tweets
Counts in
Negative Tweets
Happy learning 323
305 + 18
87 + 13
NLP hard 101
35 + 66
29 + 217
3. Regression and Sigmoid
𝑧 = 𝜃0 + 𝜃1𝑋1 + 𝜃2𝑋2
ℎ(𝑧) =
1 + 𝑒−𝑧
Update Ѳ to minimize the difference between h and label
4. Predict results with optimized parameters
Python code
Information from single words are partially lost in summation
1. Bayes’ Rule
𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ∩ ℎ𝑎𝑝𝑝𝑦 = 𝑃 ℎ𝑎𝑝𝑝𝑦 × 𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 | ℎ𝑎𝑝𝑝𝑦 = 𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 × 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 | ℎ𝑎𝑝𝑝𝑦 = 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ×
𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃 ℎ𝑎𝑝𝑝𝑦
2. 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 =
𝑓𝑟𝑒𝑞 𝑤𝑖,𝐶𝑙𝑎𝑠𝑠
3. Laplacian Smoothing to handle zero values
𝑃 𝑤𝑖 | 𝐶𝑙𝑎𝑠𝑠 =
𝑓𝑟𝑒𝑞 𝑤𝑖, 𝐶𝑙𝑎𝑠𝑠 + 1
𝑁𝐶𝑙𝑎𝑠𝑠 + 𝑉
𝑁𝐶𝑙𝑎𝑠𝑠: frequency of all words of a class V: number of unique words in vocabulary
4. Log Likelihood
𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 =
2 + 1
13 + 8
= 0.14
Doc: I am happy learning NLP
𝑙𝑜𝑔 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 = log
+ log
+ log
+ log
+ log
+ log
= 0 + 0 + 0 + 0.146 + 0 + 0 = 0.146 > 0 positive Python code
𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 | ℎ𝑎𝑝𝑝𝑦
𝑃 𝑁𝑒𝑔𝑡𝑖𝑣𝑒 | ℎ𝑎𝑝𝑝𝑦
𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃 𝑁𝑒𝑔𝑡𝑖𝑣𝑒
𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑁𝑒𝑔𝑡𝑖𝑣𝑒
𝑝𝑟𝑖𝑜𝑟 𝑟𝑎𝑡𝑖𝑜
p(w|pos) p(w|neg)
I 3 3 0.19 0.19
am 3 3 0.19 0.19
happy 2 1 0.14 0.10
because 1 0 0.10 0.05
learning 1 1 0.10 0.10
NLP 1 1 0.10 0.10
sad 1 2 0.10 0.14
not 1 2 0.10 0.14
Nclass 13 13
Words distribution are calculated without context
N-Gram and Probability [Corpus: I am happy because I am learning]
• Unigram: {I, am, happy, because, learning} 𝑃 𝐼 =
• Bigram: {I am, am happy, happy because…} 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑎𝑚 =
𝐶 𝑎𝑚 ℎ𝑎𝑝𝑝𝑦
𝐶 𝑎𝑚
• Trigram: {I am happy, am happy because…} 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝐼 𝑎𝑚 =
𝐶 𝐼 𝑎𝑚 ℎ𝑎𝑝𝑝𝑦
𝐶 𝐼 𝑎𝑚
Approximation of Sequence Probability
• Use N-Gram for approximation since long sequences are rare
Use Bigram: 𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 ≈ 𝑃 𝑡ℎ𝑒 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑃 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑠
• Interpolation for handle missing terms
Trigram: 𝑃 𝑡𝑒𝑎|𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 ≈ 0.7 × 𝑃 𝑡𝑒𝑎 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 + 0.2 × 𝑃 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑠 + 0.1 × 𝑃(tea)
• Add start and end token of sentence: <s> the teacher drinks tea </s>
𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 ≈ 𝑃 𝑡ℎ𝑒| < 𝑠 > 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑃 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑠 P(</s>|tea)
Probability Matrix [Corpus: I study I learn]
Auto Complete
Generative Text
Python code
TF -IDF (Term Frequency – Inverse Document Frequency)
tf = frequency of a term in a document
𝑖𝑑𝑓 = log
tf - idf = 𝑡𝑓 × 𝑖𝑑𝑓 = 𝑡𝑓 × log
Wikipedia TF-IDF Dataset Release
Nall: total articles
𝑁𝑡: articles with term t
Term Nt Nall idf
the 5,457,533 5,989,879 0.09
disease 67,085 5.989,879 4.49
encephalitis 904 5,989,879 8.80
TextRank (based on graph of co-occurrence words)
Important words are
surrounded by other
important words.
Word distance: 2 ~ 10
Similar to PageRank
Python Lib: Summa
YAKE (Yet Another Keyword Extractor)
Paper published in 2020
Jellyfish package is used to calculate word distance
KeyBERT (Keyword Extraction with BERT)
SentenceTransformer: word embedding for article and keywords
Supported Pretrained Models:
• stsb-roberta-large 1.31G
• nli-roberta-large 1.31G
• distilbert-base-nli-mean-tokens 244M
0 1 2 3 4 5 6 … 1023
Article 1.35 0.98 -0.34 0.94 -0.17 1.38 -0.07 … 1.09
Key 1 0.04 -0.22 -0.87 0.92 0.82 1.15 0.14 … 1.71
Then compute the similarity between the article and keywords
Setence meaning can
be pooled from:
• [CLS]
• Mean of all words
• Max of all words
 File Operation
f = open(filename, mode) f.close()
f.readline() f.write(message)
for line in f: do_something(line)
df = pd.read_csv(filename) df.to_csv(filename)
 Extract Text from HTML File
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(html_str, 'html.parser')
html_text = html_soup.get_text()
 Contraction Expansion
“can’t” → “cannot”; “We’re” → “We are”
Regular expression pattern substitution
 Word Comparison
s.startwith(t) s.endswith(t)
t in s
s.isupper() s.islower() s.istitle()
s.isalpha() s.isdigit() s.isalmum()
 String Operations
s.lower() s.upper() s.titlecase()
s.split(t) s.splitlines() s.join(t)
s.strip() s.rstrip()
s.find(t) s.rfint(t) s.replace(u,v)
 Regular Expression
import re
Remove punctuation: re.sub(r'[^ws]',‘’,s) w: word characters, s: whitespace
Find call out:‘@[A-Za-z0-9_]+’, s)[w]+, s)
 Remove Stop Words [NLTK: Natural Language Toolkit]
from nltk.corpus import stopwords
stop = stopwords.words('english')
" ".join(x for x in s.split() if x not in stop)
 Tokenization
 Stemming
“fish”, “fishing”, “fishes” → “fish”, “leaves” → “leav”
porter = nltk.PorterStemmer()
 Lemmatization
“good”, “better”, “best” → “good”, “leaves” → “leaf”
lemma = nltk.WordNetLemmatizer()
 Part of Speech (POS) Tagging
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import torch
import transformers as ppb
df = pd.read_csv('',
delimiter='t', header=None)
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased’)
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
tokenized = df[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
max_len = max(len(x) for x in tokenized.values)
padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
attention_mask = np.where(padded != 0, 1, 0)
input_ids = torch.tensor(padded).to(torch.int64)
attention_mask = torch.tensor(attention_mask).to(torch.int64)
with torch.no_grad():
last_hidden_states = model(input_ids, attention_mask=attention_mask)
features = last_hidden_states[0][:, 0, :].numpy()
labels = df[1]
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
lr_clf = LogisticRegression(), train_labels)
lr_clf.score(test_features, test_labels)
Sentence embedding in [CLS]
Logistic regression is applied to the 768 embedding values of each
sentence to decide its sentiment classification.
Result: 0.86
Context-based Embedding
Sentence A: He got bit by Python.
Sentence B: Python is my favorite programming language.
BERT Configurations
L (# of encoders) A (attention heads) H (hidden units)
Bert-base 12 12 768
Bert-large 24 16 1024
BERT uses Wordpiece Tokenizer
"Let us start pretraining the model."
tokens = [let, us, start, pre, ##train, ##ing, the, model]
Masked Language Model
The feedforward network
takes representation of masked
token as input and returns the
probability of all the words in
our vocabulary to be the
masked word
Sentiment Analysis Natural Language Inference Name Entity Recognition
Hugging Face transformers documentation
Paragraph = "The immune system is a system of many
biological structures and processes within an organism
that protects against disease. To function properly, an
immune system must detect a wide variety of agents,
known as pathogens, from viruses to parasitic worms, and
distinguish them from the organism's own healthy tissue."
Question = "What is the immune system?"
Answer = "a system of many biological structures and
processes within an organism that protects against disease"
Extractive summarization
• Pick important sentences from a text.
• Add [CLS] to represent each sentences and judge
whether the sentence should be included.
Abstractive summarization
• Paraphrasing the given text and holding
essential meaning.
Fine-tune BERT for Extractive Summarization Text Summarization with Pretrained Encoders
The final representation of the token is fed to a fully-connected layer that
produces the [CLS] relevance score s of that text with respect to the query.

