SlideShare a Scribd company logo
1 of 31
Natural language Processing (NLP)
and Transformer Models
Ding Li 2021.11
2
Use 100 ~ 1K dimensions to represent each word
Basic word embedding methods
• Word2vec (Google, 2013)
• Glove (Stanford, 2014)
• FastText (Facebook, 2016)
Continuous bag-of-words method (CBOW)
• Sliding window to select context words and center word
• Average context words as input to predict center word
• Self-supervised learning, mass corpus as training data
Python code
0
0
1
0
0
…
0
Input Word
one hot vector
1
1M
vocabularies
puppy
0.98
0.57
-0.31
…
1.62
1
100
dimensions
Word Embedding embedding
dimensions
One hot vector
size
One hot vector
size
3
Recurrent Neural Networks (RNN): keep information
Python code
want? response?
GRU help to preserve important information
Long Short-Term Memory (LSTM): same purpose
Name Entity Recognition
B: Token begins an entity I: Token is inside an entity O: Others
Sharon Floyd flew to Miami on Friday
B-per I-per O O B-geo O B-tim
4
Encoder and Decoder Structure
encoder
decoder
How are the results?
Wie sind die Ergebnisse?
Problem: as sequence size increases, performance decreases
Attention: Word Alignment
bottleneck
Retrieve
information step by
step with
disambiguation and
score it
Encode/Decode Attention: which key word is most relevant to query?
For languages with
different grammar
structures, attention
still looks at the
correct token between
them
Sampling for next word
Greedy decoding: select the most probable word at each step
Beam search: a brooder, more exploratory decoding alternative
Minimum Bayes Risk: compare many samples against each other,
select sample with the highest similarity
Python code
Info loss
Key (K)
Query
(!Q)
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 =
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇
𝑉
𝑄𝐾𝑇
Q: linear transformed
from output
K, V: linear transformed
from input
5
RNN: calculation must happen in sequence
Positional Encoding: add positional info to words
Transformer: parallel computing for all words Multi-headed Attention
Causal Attention (Self-Attention)
• Queries and Keys are words from the same sentence
• Queries should only be allowed to look at words before
• Find words deserve more attention
linear transformation
• Each head uses
different linear
transformations to
represent words
• Different heads can
learn different
relationships
between words
Transformer Decoder
Python code
Online Summarization Tool
transformers GitHub
6
Create the query Q , key K, and value V
by multiplying the input matrix X, with weight matrics Wq, Wk, and Wv
Self Attention The meaning of a word can come from other words in sentence:
7
Bidirectional Encoder Representations from Transformers)
Transfer Learning
Pre-training (base model 110M parameters, large model 340M)
Pre-training basic model with massive data
Fine-turning models for different applications
Mask Language Modeling (MLM)
Next Sentence Prediction (NSP)
The legislators
believed that they
were on the right
side of history.
So they changed the law.
Then the bunny ate the carrot.
Pre-training data
• Books Corpus (800M words)
• English Wikipedia (2,500M words, ~13G)
Fine-turning and Data Input
Pre-training Sentence A Sentence B
Input Result
MLM, NSP
Classification Text None Sentiment pos/neg?
Grammar correct?
Question
Answering
Question Passage Answer or location in passage
Summary Article Summary Summary of the article
Natural Language
Inference
Hypothesis Premise Entailment, contraction, neutral?
Natural language
inference is the task of
determining whether a
“hypothesis” is true
(entailment), false
(contradiction), or
undetermined
(neutral) given a
“premise”.
Name Entity
Recognition
Sentence Entities Entities and tags
Paraphrase Sentence Paraphrase Paraphrase of the sentence
Bert GitHub Python
Paper 2019
8
The paper uses the Medical Information Mart
for Intensive Care III (MIMIC-III) dataset.
MIMIC-III consists of the electronic health records
of 58,976 unique hospital admissions from 38,597
patients in the intensive care unit of the Beth
Israel Deaconess Medical Center between 2001
and 2012. There are 2,083,180 de-identified notes
associated with the admissions.
ClinicalBERT accurately predicts 30-day
readmission using discharge summaries.
AUROC: Area under the receiver operating characteristic curve
AUPRC: Area under the precision-recall curve
PR80: Recall at precision of 80%
ClinicalBert paper
BioBert: trained with PubMed abstracts (PubMed) and/or PubMed Central full-text articles (PMC) GitHub
9
Text-to-Text Transfer Transformer)
Unified Multi-Task Framework: Text as Input, Text as Output
Cola: Corpus of Linguistic Acceptability
STSB: Semantic Textual Similarity Benchmark
RTE: Recognizing Textual Entailment
MNLI: Multi-Genre Natural Language Inference
MRPC: Microsoft Research Paraphrase Corpus
SQuAD: Stanford Question Answering Dataset
WMT English to German
COPA: Choice of Plausible Alternatives, causal reasoning
MultiRC: Multi-Sentence Reading Comprehension
WiC
Word in Context
WSC: Winograd Schema Challenge, resolve ambiguity
The city councilmen refused the demonstrators a permit
because they [feared/advocated] violence.
Question: “they” refers to?
Transfer Learning with C4 – Colossal Cleaned Crawl Corpus (~800G), base model with 220M parameters, large model 770M, largest 11B
T5 GitHub Paper 2020 Python
10
Language Model Meta-Learning
Larger Models Make Increasingly Efficient Use of In-
Context Information
paper
Datasets Used to Train GPT-3
Model Size ~ TriviaQA Performance
SAT Analogies (65% ~ avg applications 57%)
11
paper
12
Gu 2021 models
Less domain
vocabulary
More domain
vocabulary
1
3
Model Model Full Name Vocabulary Training Size
BERT bert-base-uncased Wiki + Books 16G
RoBERTa roberta-base Web Crawl 160G
PubMedBert microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext PubMed 21G
14
Self-Alignment Pretraining for Biomedical Entity Representations SapBert GitHub Model
Liu 2021
Figure 1: The t-SNE visualization of UMLS entities under PUBMEDBERT (BERT pretrained on
PubMed papers) & PUBMEDBERT+ SAPBERT (PUBMEDBERT further pretrained on UMLS
synonyms). The biomedical names of different concepts are hard to separate in the
heterogeneous embedding space (left). After the self-alignment pretraining, the same
concept’s entity names are drawn closer to form compact clusters (right).
Pertaining with UMLS (Unified Medical Language System)
4M+ concepts & 10M+ synonyms (MeSH, SNOMED, RxNorm, Gene Ontology, & OMIM)
Hard Pairs Mining (𝑥𝑎, 𝑥𝑝, 𝑥𝑛)
𝑥𝑎: anchor; 𝑥𝑝: positive synonym match; 𝑥𝑛 : negative synonym match
Only consider triplets with the negative sample closer to the positive sample by a margin of λ.
Loss Function
S: similarity matrix among 𝜒𝑏 items in batch b
Negative pair similarity
should be small
Positive pair similarity
should be large
15
Radford 2021 GitHub
16
Colab
• We demonstrate that the simple pre-training
task of predicting which caption goes with
which image is an efficient and scalable way
to learn SOTA image representations from
scratch on a dataset of 400 million (image,
text) pairs collected from the internet.
• After pre-training, natural language is used to
reference learned visual concepts (or describe
new ones) enabling zero-shot transfer of the
model to downstream tasks.
• We study the performance of this approach by
benchmarking on over 30 different existing
computer vision datasets, spanning tasks such
as OCR, action recognition in videos, geo-
localization, and many types of fine-grained
object classification.
• The model transfers non-trivially to most tasks
and is often competitive with a fully
supervised baseline without the need for any
dataset specific training.
Blog
17
Masked Autoencoders (MAE) Are Scalable Vision Learners He 2021
Figure 1. Our MAE architecture. During pre-training, a large random
subset of image patches (e.g., 75%) is masked out. The encoder is
applied to the small subset of visible patches. Mask tokens are
introduced after the encoder, and the full set of encoded patches and
mask tokens is processed by a small decoder that reconstructs the
original image in pixels. After pre-training, the decoder is discarded,
and the encoder is applied to uncorrupted images to produce
representations for recognition tasks.
Figure 4. Reconstructions of ImageNet validation images using an MAE
pre-trained with a masking ratio of 75% but applied on inputs with
higher masking ratios. The predictions differ plausibly from the original
images, showing that the method can generalize.
18
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Baevski 2022
GitHub
19
 Self-supervised learning makes all human’s text as machine’s potential training data.
 Machines are not only trained with text’s meaning and semantics, but also reasoning.
 Models with billions of parameters are increasing their sophisticated capabilities fast.
20
 Coursera
Natural Language Processing Specialization
Applied Text Mining in Python
 Books
Getting Started with Google BERT
 Papers
Attention Is All You Need
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
Pretrained Transformers for Text Ranking: BERT and Beyond
PASSAGE RE-RANKING WITH BERT
 Blogs
Illustrated: Self-Attention
Natural language inference
Keyword Extraction: from TF-IDF to BERT
Understanding searches better than ever before
 Projects
NLP-progress
Bert Extractive Summarizer
 Colab
A Visual Notebook to Using BERT for the First Time (blog)
2
1
1. Count word frequency in all training tweets
Word in
All Tweets
Counts in
Positive Tweets
Counts in
Negative Tweets
Happy 305 87
Hard 66 217
NLP 34 29
Learning 18 13
2. Sum the frequency for each tweet
Tweets
Counts in
Positive Tweets
X1
Counts in
Negative Tweets
X2
Happy learning 323
305 + 18
100
87 + 13
NLP hard 101
35 + 66
246
29 + 217
3. Regression and Sigmoid
𝑧 = 𝜃0 + 𝜃1𝑋1 + 𝜃2𝑋2
ℎ(𝑧) =
1
1 + 𝑒−𝑧
Update Ѳ to minimize the difference between h and label
4. Predict results with optimized parameters
positive
negative
Python code
Issue:
Information from single words are partially lost in summation
2
2
1. Bayes’ Rule
𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ∩ ℎ𝑎𝑝𝑝𝑦 = 𝑃 ℎ𝑎𝑝𝑝𝑦 × 𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 | ℎ𝑎𝑝𝑝𝑦 = 𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 × 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 | ℎ𝑎𝑝𝑝𝑦 = 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ×
𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃 ℎ𝑎𝑝𝑝𝑦
2. 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 =
𝑓𝑟𝑒𝑞 𝑤𝑖,𝐶𝑙𝑎𝑠𝑠
𝑁𝐶𝑙𝑎𝑠𝑠
=
2
13
3. Laplacian Smoothing to handle zero values
𝑃 𝑤𝑖 | 𝐶𝑙𝑎𝑠𝑠 =
𝑓𝑟𝑒𝑞 𝑤𝑖, 𝐶𝑙𝑎𝑠𝑠 + 1
𝑁𝐶𝑙𝑎𝑠𝑠 + 𝑉
𝑁𝐶𝑙𝑎𝑠𝑠: frequency of all words of a class V: number of unique words in vocabulary
4. Log Likelihood
𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 =
2 + 1
13 + 8
= 0.14
Doc: I am happy learning NLP
𝑙𝑜𝑔 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 = log
0.50
0.50
+ log
0.19
0.19
+ log
0.19
0.19
+ log
0.14
0.10
+ log
0.10
0.10
+ log
0.10
0.10
= 0 + 0 + 0 + 0.146 + 0 + 0 = 0.146 > 0 positive Python code
𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 | ℎ𝑎𝑝𝑝𝑦
𝑃 𝑁𝑒𝑔𝑡𝑖𝑣𝑒 | ℎ𝑎𝑝𝑝𝑦
=
𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃 𝑁𝑒𝑔𝑡𝑖𝑣𝑒
×
𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑁𝑒𝑔𝑡𝑖𝑣𝑒
𝑝𝑟𝑖𝑜𝑟 𝑟𝑎𝑡𝑖𝑜
Word
Pos
counts
Neg
counts
p(w|pos) p(w|neg)
I 3 3 0.19 0.19
am 3 3 0.19 0.19
happy 2 1 0.14 0.10
because 1 0 0.10 0.05
learning 1 1 0.10 0.10
NLP 1 1 0.10 0.10
sad 1 2 0.10 0.14
not 1 2 0.10 0.14
Nclass 13 13
Issue:
Words distribution are calculated without context
2
3
N-Gram and Probability [Corpus: I am happy because I am learning]
• Unigram: {I, am, happy, because, learning} 𝑃 𝐼 =
𝐶 𝐼
𝐶(𝐴𝑙𝑙)
=
2
7
• Bigram: {I am, am happy, happy because…} 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑎𝑚 =
𝐶 𝑎𝑚 ℎ𝑎𝑝𝑝𝑦
𝐶 𝑎𝑚
=
1
2
• Trigram: {I am happy, am happy because…} 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝐼 𝑎𝑚 =
𝐶 𝐼 𝑎𝑚 ℎ𝑎𝑝𝑝𝑦
𝐶 𝐼 𝑎𝑚
=
1
2
Approximation of Sequence Probability
• Use N-Gram for approximation since long sequences are rare
Use Bigram: 𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 ≈ 𝑃 𝑡ℎ𝑒 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑃 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑠
• Interpolation for handle missing terms
Trigram: 𝑃 𝑡𝑒𝑎|𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 ≈ 0.7 × 𝑃 𝑡𝑒𝑎 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 + 0.2 × 𝑃 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑠 + 0.1 × 𝑃(tea)
• Add start and end token of sentence: <s> the teacher drinks tea </s>
𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 ≈ 𝑃 𝑡ℎ𝑒| < 𝑠 > 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑃 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑠 P(</s>|tea)
Probability Matrix [Corpus: I study I learn]
Applications
Auto Complete
Generative Text
Python code
24
TF -IDF (Term Frequency – Inverse Document Frequency)
tf = frequency of a term in a document
𝑖𝑑𝑓 = log
𝑁𝑎𝑙𝑙
𝑁𝑡
,
tf - idf = 𝑡𝑓 × 𝑖𝑑𝑓 = 𝑡𝑓 × log
𝑁
𝑁𝑡
Wikipedia TF-IDF Dataset Release
Nall: total articles
𝑁𝑡: articles with term t
Term Nt Nall idf
the 5,457,533 5,989,879 0.09
disease 67,085 5.989,879 4.49
encephalitis 904 5,989,879 8.80
TextRank (based on graph of co-occurrence words)
Important words are
surrounded by other
important words.
Word distance: 2 ~ 10
Similar to PageRank
Python Lib: Summa
YAKE (Yet Another Keyword Extractor)
Paper published in 2020
Jellyfish package is used to calculate word distance
KeyBERT (Keyword Extraction with BERT)
SentenceTransformer: word embedding for article and keywords
Supported Pretrained Models:
• stsb-roberta-large 1.31G
• nli-roberta-large 1.31G
• distilbert-base-nli-mean-tokens 244M
0 1 2 3 4 5 6 … 1023
Article 1.35 0.98 -0.34 0.94 -0.17 1.38 -0.07 … 1.09
Key 1 0.04 -0.22 -0.87 0.92 0.82 1.15 0.14 … 1.71
Then compute the similarity between the article and keywords
Setence meaning can
be pooled from:
• [CLS]
• Mean of all words
• Max of all words
2
5
 File Operation
f = open(filename, mode) f.close()
f.readline() f.read(n) f.write(message)
for line in f: do_something(line)
df = pd.read_csv(filename) df.to_csv(filename)
 Extract Text from HTML File
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(html_str, 'html.parser')
html_text = html_soup.get_text()
 Contraction Expansion
“can’t” → “cannot”; “We’re” → “We are”
Regular expression pattern substitution
 Word Comparison
s.startwith(t) s.endswith(t)
t in s
s.isupper() s.islower() s.istitle()
s.isalpha() s.isdigit() s.isalmum()
 String Operations
s.lower() s.upper() s.titlecase()
s.split(t) s.splitlines() s.join(t)
s.strip() s.rstrip()
s.find(t) s.rfint(t) s.replace(u,v)
 Regular Expression
import re
Remove punctuation: re.sub(r'[^ws]',‘’,s) w: word characters, s: whitespace
Find call out: re.search(‘@[A-Za-z0-9_]+’, s) re.search(@[w]+, s)
 Remove Stop Words [NLTK: Natural Language Toolkit]
from nltk.corpus import stopwords
nltk.download()
stop = stopwords.words('english')
" ".join(x for x in s.split() if x not in stop)
 Tokenization
nltk.word_tokenize(text)
nltk.sent_tokenize(text)
 Stemming
“fish”, “fishing”, “fishes” → “fish”, “leaves” → “leav”
porter = nltk.PorterStemmer()
porter.stem(‘fishing’)
 Lemmatization
“good”, “better”, “best” → “good”, “leaves” → “leaf”
lemma = nltk.WordNetLemmatizer()
lemma.lemmatize(‘leaves’)
 Part of Speech (POS) Tagging
nltk.pos_tag()
2
6
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import torch
import transformers as ppb
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv',
delimiter='t', header=None)
model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased’)
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
tokenized = df[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))
max_len = max(len(x) for x in tokenized.values)
padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
attention_mask = np.where(padded != 0, 1, 0)
input_ids = torch.tensor(padded).to(torch.int64)
attention_mask = torch.tensor(attention_mask).to(torch.int64)
with torch.no_grad():
last_hidden_states = model(input_ids, attention_mask=attention_mask)
features = last_hidden_states[0][:, 0, :].numpy()
labels = df[1]
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)
lr_clf.score(test_features, test_labels)
df
tokenized
Sentence embedding in [CLS]
Logistic regression is applied to the 768 embedding values of each
sentence to decide its sentiment classification.
Result: 0.86
Colab
27
Context-based Embedding
Sentence A: He got bit by Python.
Sentence B: Python is my favorite programming language.
BERT Configurations
L (# of encoders) A (attention heads) H (hidden units)
Bert-base 12 12 768
Bert-large 24 16 1024
BERT uses Wordpiece Tokenizer
"Let us start pretraining the model."
tokens = [let, us, start, pre, ##train, ##ing, the, model]
Masked Language Model
The feedforward network
takes representation of masked
token as input and returns the
probability of all the words in
our vocabulary to be the
masked word
28
Sentiment Analysis Natural Language Inference Name Entity Recognition
Hugging Face transformers documentation
29
Paragraph = "The immune system is a system of many
biological structures and processes within an organism
that protects against disease. To function properly, an
immune system must detect a wide variety of agents,
known as pathogens, from viruses to parasitic worms, and
distinguish them from the organism's own healthy tissue."
Question = "What is the immune system?"
Answer = "a system of many biological structures and
processes within an organism that protects against disease"
30
Extractive summarization
• Pick important sentences from a text.
• Add [CLS] to represent each sentences and judge
whether the sentence should be included.
Abstractive summarization
• Paraphrasing the given text and holding
essential meaning.
Fine-tune BERT for Extractive Summarization Text Summarization with Pretrained Encoders
3
1
monoBERT
The final representation of the token is fed to a fully-connected layer that
produces the [CLS] relevance score s of that text with respect to the query.
Birch

More Related Content

What's hot

NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsOVHcloud
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...David Talby
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning健程 杨
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingVeenaSKumar2
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanismKhang Pham
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 

What's hot (20)

NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Bert
BertBert
Bert
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Word embedding
Word embedding Word embedding
Word embedding
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
 
BERT
BERTBERT
BERT
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 

Similar to Natural language processing and transformer models

Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPindico data
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answeringAli Kabbadj
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023HyunJoon Jung
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
1066_multitask_prompted_training_en.pdf
1066_multitask_prompted_training_en.pdf1066_multitask_prompted_training_en.pdf
1066_multitask_prompted_training_en.pdfssusere320ca
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
download
downloaddownload
downloadbutest
 
download
downloaddownload
downloadbutest
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...jcscholtes
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Anubhav Jain
 

Similar to Natural language processing and transformer models (20)

Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
Tensorflowv5.0
Tensorflowv5.0Tensorflowv5.0
Tensorflowv5.0
 
1808.10245v1 (1).pdf
1808.10245v1 (1).pdf1808.10245v1 (1).pdf
1808.10245v1 (1).pdf
 
DeepPavlov 2019
DeepPavlov 2019DeepPavlov 2019
DeepPavlov 2019
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
Chatbot_Presentation
Chatbot_PresentationChatbot_Presentation
Chatbot_Presentation
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
1066_multitask_prompted_training_en.pdf
1066_multitask_prompted_training_en.pdf1066_multitask_prompted_training_en.pdf
1066_multitask_prompted_training_en.pdf
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
download
downloaddownload
download
 
download
downloaddownload
download
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 

More from Ding Li

Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
Seismic data analysis with u net
Seismic data analysis with u netSeismic data analysis with u net
Seismic data analysis with u netDing Li
 
Titanic survivor prediction by machine learning
Titanic survivor prediction by machine learningTitanic survivor prediction by machine learning
Titanic survivor prediction by machine learningDing Li
 
Find nuclei in images with U-net
Find nuclei in images with U-netFind nuclei in images with U-net
Find nuclei in images with U-netDing Li
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDing Li
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Recommendation system
Recommendation systemRecommendation system
Recommendation systemDing Li
 
Practical data science
Practical data sciencePractical data science
Practical data scienceDing Li
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networksDing Li
 
AI to advance science research
AI to advance science researchAI to advance science research
AI to advance science researchDing Li
 
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graphDing Li
 
Great neck school budget 2016-2017 analysis
Great neck school budget 2016-2017 analysisGreat neck school budget 2016-2017 analysis
Great neck school budget 2016-2017 analysisDing Li
 
Business Intelligence and Big Data in Cloud
Business Intelligence and Big Data in CloudBusiness Intelligence and Big Data in Cloud
Business Intelligence and Big Data in CloudDing Li
 

More from Ding Li (13)

Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Seismic data analysis with u net
Seismic data analysis with u netSeismic data analysis with u net
Seismic data analysis with u net
 
Titanic survivor prediction by machine learning
Titanic survivor prediction by machine learningTitanic survivor prediction by machine learning
Titanic survivor prediction by machine learning
 
Find nuclei in images with U-net
Find nuclei in images with U-netFind nuclei in images with U-net
Find nuclei in images with U-net
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
 
Practical data science
Practical data sciencePractical data science
Practical data science
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
AI to advance science research
AI to advance science researchAI to advance science research
AI to advance science research
 
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graph
 
Great neck school budget 2016-2017 analysis
Great neck school budget 2016-2017 analysisGreat neck school budget 2016-2017 analysis
Great neck school budget 2016-2017 analysis
 
Business Intelligence and Big Data in Cloud
Business Intelligence and Big Data in CloudBusiness Intelligence and Big Data in Cloud
Business Intelligence and Big Data in Cloud
 

Recently uploaded

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 

Recently uploaded (20)

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 

Natural language processing and transformer models

  • 1. Natural language Processing (NLP) and Transformer Models Ding Li 2021.11
  • 2. 2 Use 100 ~ 1K dimensions to represent each word Basic word embedding methods • Word2vec (Google, 2013) • Glove (Stanford, 2014) • FastText (Facebook, 2016) Continuous bag-of-words method (CBOW) • Sliding window to select context words and center word • Average context words as input to predict center word • Self-supervised learning, mass corpus as training data Python code 0 0 1 0 0 … 0 Input Word one hot vector 1 1M vocabularies puppy 0.98 0.57 -0.31 … 1.62 1 100 dimensions Word Embedding embedding dimensions One hot vector size One hot vector size
  • 3. 3 Recurrent Neural Networks (RNN): keep information Python code want? response? GRU help to preserve important information Long Short-Term Memory (LSTM): same purpose Name Entity Recognition B: Token begins an entity I: Token is inside an entity O: Others Sharon Floyd flew to Miami on Friday B-per I-per O O B-geo O B-tim
  • 4. 4 Encoder and Decoder Structure encoder decoder How are the results? Wie sind die Ergebnisse? Problem: as sequence size increases, performance decreases Attention: Word Alignment bottleneck Retrieve information step by step with disambiguation and score it Encode/Decode Attention: which key word is most relevant to query? For languages with different grammar structures, attention still looks at the correct token between them Sampling for next word Greedy decoding: select the most probable word at each step Beam search: a brooder, more exploratory decoding alternative Minimum Bayes Risk: compare many samples against each other, select sample with the highest similarity Python code Info loss Key (K) Query (!Q) 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇 𝑉 𝑄𝐾𝑇 Q: linear transformed from output K, V: linear transformed from input
  • 5. 5 RNN: calculation must happen in sequence Positional Encoding: add positional info to words Transformer: parallel computing for all words Multi-headed Attention Causal Attention (Self-Attention) • Queries and Keys are words from the same sentence • Queries should only be allowed to look at words before • Find words deserve more attention linear transformation • Each head uses different linear transformations to represent words • Different heads can learn different relationships between words Transformer Decoder Python code Online Summarization Tool transformers GitHub
  • 6. 6 Create the query Q , key K, and value V by multiplying the input matrix X, with weight matrics Wq, Wk, and Wv Self Attention The meaning of a word can come from other words in sentence:
  • 7. 7 Bidirectional Encoder Representations from Transformers) Transfer Learning Pre-training (base model 110M parameters, large model 340M) Pre-training basic model with massive data Fine-turning models for different applications Mask Language Modeling (MLM) Next Sentence Prediction (NSP) The legislators believed that they were on the right side of history. So they changed the law. Then the bunny ate the carrot. Pre-training data • Books Corpus (800M words) • English Wikipedia (2,500M words, ~13G) Fine-turning and Data Input Pre-training Sentence A Sentence B Input Result MLM, NSP Classification Text None Sentiment pos/neg? Grammar correct? Question Answering Question Passage Answer or location in passage Summary Article Summary Summary of the article Natural Language Inference Hypothesis Premise Entailment, contraction, neutral? Natural language inference is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”. Name Entity Recognition Sentence Entities Entities and tags Paraphrase Sentence Paraphrase Paraphrase of the sentence Bert GitHub Python Paper 2019
  • 8. 8 The paper uses the Medical Information Mart for Intensive Care III (MIMIC-III) dataset. MIMIC-III consists of the electronic health records of 58,976 unique hospital admissions from 38,597 patients in the intensive care unit of the Beth Israel Deaconess Medical Center between 2001 and 2012. There are 2,083,180 de-identified notes associated with the admissions. ClinicalBERT accurately predicts 30-day readmission using discharge summaries. AUROC: Area under the receiver operating characteristic curve AUPRC: Area under the precision-recall curve PR80: Recall at precision of 80% ClinicalBert paper BioBert: trained with PubMed abstracts (PubMed) and/or PubMed Central full-text articles (PMC) GitHub
  • 9. 9 Text-to-Text Transfer Transformer) Unified Multi-Task Framework: Text as Input, Text as Output Cola: Corpus of Linguistic Acceptability STSB: Semantic Textual Similarity Benchmark RTE: Recognizing Textual Entailment MNLI: Multi-Genre Natural Language Inference MRPC: Microsoft Research Paraphrase Corpus SQuAD: Stanford Question Answering Dataset WMT English to German COPA: Choice of Plausible Alternatives, causal reasoning MultiRC: Multi-Sentence Reading Comprehension WiC Word in Context WSC: Winograd Schema Challenge, resolve ambiguity The city councilmen refused the demonstrators a permit because they [feared/advocated] violence. Question: “they” refers to? Transfer Learning with C4 – Colossal Cleaned Crawl Corpus (~800G), base model with 220M parameters, large model 770M, largest 11B T5 GitHub Paper 2020 Python
  • 10. 10 Language Model Meta-Learning Larger Models Make Increasingly Efficient Use of In- Context Information paper Datasets Used to Train GPT-3 Model Size ~ TriviaQA Performance SAT Analogies (65% ~ avg applications 57%)
  • 12. 12 Gu 2021 models Less domain vocabulary More domain vocabulary
  • 13. 1 3 Model Model Full Name Vocabulary Training Size BERT bert-base-uncased Wiki + Books 16G RoBERTa roberta-base Web Crawl 160G PubMedBert microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext PubMed 21G
  • 14. 14 Self-Alignment Pretraining for Biomedical Entity Representations SapBert GitHub Model Liu 2021 Figure 1: The t-SNE visualization of UMLS entities under PUBMEDBERT (BERT pretrained on PubMed papers) & PUBMEDBERT+ SAPBERT (PUBMEDBERT further pretrained on UMLS synonyms). The biomedical names of different concepts are hard to separate in the heterogeneous embedding space (left). After the self-alignment pretraining, the same concept’s entity names are drawn closer to form compact clusters (right). Pertaining with UMLS (Unified Medical Language System) 4M+ concepts & 10M+ synonyms (MeSH, SNOMED, RxNorm, Gene Ontology, & OMIM) Hard Pairs Mining (𝑥𝑎, 𝑥𝑝, 𝑥𝑛) 𝑥𝑎: anchor; 𝑥𝑝: positive synonym match; 𝑥𝑛 : negative synonym match Only consider triplets with the negative sample closer to the positive sample by a margin of λ. Loss Function S: similarity matrix among 𝜒𝑏 items in batch b Negative pair similarity should be small Positive pair similarity should be large
  • 16. 16 Colab • We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. • After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. • We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo- localization, and many types of fine-grained object classification. • The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. Blog
  • 17. 17 Masked Autoencoders (MAE) Are Scalable Vision Learners He 2021 Figure 1. Our MAE architecture. During pre-training, a large random subset of image patches (e.g., 75%) is masked out. The encoder is applied to the small subset of visible patches. Mask tokens are introduced after the encoder, and the full set of encoded patches and mask tokens is processed by a small decoder that reconstructs the original image in pixels. After pre-training, the decoder is discarded, and the encoder is applied to uncorrupted images to produce representations for recognition tasks. Figure 4. Reconstructions of ImageNet validation images using an MAE pre-trained with a masking ratio of 75% but applied on inputs with higher masking ratios. The predictions differ plausibly from the original images, showing that the method can generalize.
  • 18. 18 data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language Baevski 2022 GitHub
  • 19. 19  Self-supervised learning makes all human’s text as machine’s potential training data.  Machines are not only trained with text’s meaning and semantics, but also reasoning.  Models with billions of parameters are increasing their sophisticated capabilities fast.
  • 20. 20  Coursera Natural Language Processing Specialization Applied Text Mining in Python  Books Getting Started with Google BERT  Papers Attention Is All You Need BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing Pretrained Transformers for Text Ranking: BERT and Beyond PASSAGE RE-RANKING WITH BERT  Blogs Illustrated: Self-Attention Natural language inference Keyword Extraction: from TF-IDF to BERT Understanding searches better than ever before  Projects NLP-progress Bert Extractive Summarizer  Colab A Visual Notebook to Using BERT for the First Time (blog)
  • 21. 2 1 1. Count word frequency in all training tweets Word in All Tweets Counts in Positive Tweets Counts in Negative Tweets Happy 305 87 Hard 66 217 NLP 34 29 Learning 18 13 2. Sum the frequency for each tweet Tweets Counts in Positive Tweets X1 Counts in Negative Tweets X2 Happy learning 323 305 + 18 100 87 + 13 NLP hard 101 35 + 66 246 29 + 217 3. Regression and Sigmoid 𝑧 = 𝜃0 + 𝜃1𝑋1 + 𝜃2𝑋2 ℎ(𝑧) = 1 1 + 𝑒−𝑧 Update Ѳ to minimize the difference between h and label 4. Predict results with optimized parameters positive negative Python code Issue: Information from single words are partially lost in summation
  • 22. 2 2 1. Bayes’ Rule 𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ∩ ℎ𝑎𝑝𝑝𝑦 = 𝑃 ℎ𝑎𝑝𝑝𝑦 × 𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 | ℎ𝑎𝑝𝑝𝑦 = 𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 × 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 | ℎ𝑎𝑝𝑝𝑦 = 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 × 𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑃 ℎ𝑎𝑝𝑝𝑦 2. 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 𝑓𝑟𝑒𝑞 𝑤𝑖,𝐶𝑙𝑎𝑠𝑠 𝑁𝐶𝑙𝑎𝑠𝑠 = 2 13 3. Laplacian Smoothing to handle zero values 𝑃 𝑤𝑖 | 𝐶𝑙𝑎𝑠𝑠 = 𝑓𝑟𝑒𝑞 𝑤𝑖, 𝐶𝑙𝑎𝑠𝑠 + 1 𝑁𝐶𝑙𝑎𝑠𝑠 + 𝑉 𝑁𝐶𝑙𝑎𝑠𝑠: frequency of all words of a class V: number of unique words in vocabulary 4. Log Likelihood 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 2 + 1 13 + 8 = 0.14 Doc: I am happy learning NLP 𝑙𝑜𝑔 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 = log 0.50 0.50 + log 0.19 0.19 + log 0.19 0.19 + log 0.14 0.10 + log 0.10 0.10 + log 0.10 0.10 = 0 + 0 + 0 + 0.146 + 0 + 0 = 0.146 > 0 positive Python code 𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 | ℎ𝑎𝑝𝑝𝑦 𝑃 𝑁𝑒𝑔𝑡𝑖𝑣𝑒 | ℎ𝑎𝑝𝑝𝑦 = 𝑃 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑃 𝑁𝑒𝑔𝑡𝑖𝑣𝑒 × 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑁𝑒𝑔𝑡𝑖𝑣𝑒 𝑝𝑟𝑖𝑜𝑟 𝑟𝑎𝑡𝑖𝑜 Word Pos counts Neg counts p(w|pos) p(w|neg) I 3 3 0.19 0.19 am 3 3 0.19 0.19 happy 2 1 0.14 0.10 because 1 0 0.10 0.05 learning 1 1 0.10 0.10 NLP 1 1 0.10 0.10 sad 1 2 0.10 0.14 not 1 2 0.10 0.14 Nclass 13 13 Issue: Words distribution are calculated without context
  • 23. 2 3 N-Gram and Probability [Corpus: I am happy because I am learning] • Unigram: {I, am, happy, because, learning} 𝑃 𝐼 = 𝐶 𝐼 𝐶(𝐴𝑙𝑙) = 2 7 • Bigram: {I am, am happy, happy because…} 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝑎𝑚 = 𝐶 𝑎𝑚 ℎ𝑎𝑝𝑝𝑦 𝐶 𝑎𝑚 = 1 2 • Trigram: {I am happy, am happy because…} 𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝐼 𝑎𝑚 = 𝐶 𝐼 𝑎𝑚 ℎ𝑎𝑝𝑝𝑦 𝐶 𝐼 𝑎𝑚 = 1 2 Approximation of Sequence Probability • Use N-Gram for approximation since long sequences are rare Use Bigram: 𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 ≈ 𝑃 𝑡ℎ𝑒 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑃 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑠 • Interpolation for handle missing terms Trigram: 𝑃 𝑡𝑒𝑎|𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 ≈ 0.7 × 𝑃 𝑡𝑒𝑎 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 + 0.2 × 𝑃 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑠 + 0.1 × 𝑃(tea) • Add start and end token of sentence: <s> the teacher drinks tea </s> 𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 ≈ 𝑃 𝑡ℎ𝑒| < 𝑠 > 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑃 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑠 P(</s>|tea) Probability Matrix [Corpus: I study I learn] Applications Auto Complete Generative Text Python code
  • 24. 24 TF -IDF (Term Frequency – Inverse Document Frequency) tf = frequency of a term in a document 𝑖𝑑𝑓 = log 𝑁𝑎𝑙𝑙 𝑁𝑡 , tf - idf = 𝑡𝑓 × 𝑖𝑑𝑓 = 𝑡𝑓 × log 𝑁 𝑁𝑡 Wikipedia TF-IDF Dataset Release Nall: total articles 𝑁𝑡: articles with term t Term Nt Nall idf the 5,457,533 5,989,879 0.09 disease 67,085 5.989,879 4.49 encephalitis 904 5,989,879 8.80 TextRank (based on graph of co-occurrence words) Important words are surrounded by other important words. Word distance: 2 ~ 10 Similar to PageRank Python Lib: Summa YAKE (Yet Another Keyword Extractor) Paper published in 2020 Jellyfish package is used to calculate word distance KeyBERT (Keyword Extraction with BERT) SentenceTransformer: word embedding for article and keywords Supported Pretrained Models: • stsb-roberta-large 1.31G • nli-roberta-large 1.31G • distilbert-base-nli-mean-tokens 244M 0 1 2 3 4 5 6 … 1023 Article 1.35 0.98 -0.34 0.94 -0.17 1.38 -0.07 … 1.09 Key 1 0.04 -0.22 -0.87 0.92 0.82 1.15 0.14 … 1.71 Then compute the similarity between the article and keywords Setence meaning can be pooled from: • [CLS] • Mean of all words • Max of all words
  • 25. 2 5  File Operation f = open(filename, mode) f.close() f.readline() f.read(n) f.write(message) for line in f: do_something(line) df = pd.read_csv(filename) df.to_csv(filename)  Extract Text from HTML File from bs4 import BeautifulSoup html_soup = BeautifulSoup(html_str, 'html.parser') html_text = html_soup.get_text()  Contraction Expansion “can’t” → “cannot”; “We’re” → “We are” Regular expression pattern substitution  Word Comparison s.startwith(t) s.endswith(t) t in s s.isupper() s.islower() s.istitle() s.isalpha() s.isdigit() s.isalmum()  String Operations s.lower() s.upper() s.titlecase() s.split(t) s.splitlines() s.join(t) s.strip() s.rstrip() s.find(t) s.rfint(t) s.replace(u,v)  Regular Expression import re Remove punctuation: re.sub(r'[^ws]',‘’,s) w: word characters, s: whitespace Find call out: re.search(‘@[A-Za-z0-9_]+’, s) re.search(@[w]+, s)  Remove Stop Words [NLTK: Natural Language Toolkit] from nltk.corpus import stopwords nltk.download() stop = stopwords.words('english') " ".join(x for x in s.split() if x not in stop)  Tokenization nltk.word_tokenize(text) nltk.sent_tokenize(text)  Stemming “fish”, “fishing”, “fishes” → “fish”, “leaves” → “leav” porter = nltk.PorterStemmer() porter.stem(‘fishing’)  Lemmatization “good”, “better”, “best” → “good”, “leaves” → “leaf” lemma = nltk.WordNetLemmatizer() lemma.lemmatize(‘leaves’)  Part of Speech (POS) Tagging nltk.pos_tag()
  • 26. 2 6 import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression import torch import transformers as ppb df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='t', header=None) model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased’) tokenizer = tokenizer_class.from_pretrained(pretrained_weights) model = model_class.from_pretrained(pretrained_weights) tokenized = df[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True))) max_len = max(len(x) for x in tokenized.values) padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values]) attention_mask = np.where(padded != 0, 1, 0) input_ids = torch.tensor(padded).to(torch.int64) attention_mask = torch.tensor(attention_mask).to(torch.int64) with torch.no_grad(): last_hidden_states = model(input_ids, attention_mask=attention_mask) features = last_hidden_states[0][:, 0, :].numpy() labels = df[1] train_features, test_features, train_labels, test_labels = train_test_split(features, labels) lr_clf = LogisticRegression() lr_clf.fit(train_features, train_labels) lr_clf.score(test_features, test_labels) df tokenized Sentence embedding in [CLS] Logistic regression is applied to the 768 embedding values of each sentence to decide its sentiment classification. Result: 0.86 Colab
  • 27. 27 Context-based Embedding Sentence A: He got bit by Python. Sentence B: Python is my favorite programming language. BERT Configurations L (# of encoders) A (attention heads) H (hidden units) Bert-base 12 12 768 Bert-large 24 16 1024 BERT uses Wordpiece Tokenizer "Let us start pretraining the model." tokens = [let, us, start, pre, ##train, ##ing, the, model] Masked Language Model The feedforward network takes representation of masked token as input and returns the probability of all the words in our vocabulary to be the masked word
  • 28. 28 Sentiment Analysis Natural Language Inference Name Entity Recognition Hugging Face transformers documentation
  • 29. 29 Paragraph = "The immune system is a system of many biological structures and processes within an organism that protects against disease. To function properly, an immune system must detect a wide variety of agents, known as pathogens, from viruses to parasitic worms, and distinguish them from the organism's own healthy tissue." Question = "What is the immune system?" Answer = "a system of many biological structures and processes within an organism that protects against disease"
  • 30. 30 Extractive summarization • Pick important sentences from a text. • Add [CLS] to represent each sentences and judge whether the sentence should be included. Abstractive summarization • Paraphrasing the given text and holding essential meaning. Fine-tune BERT for Extractive Summarization Text Summarization with Pretrained Encoders
  • 31. 3 1 monoBERT The final representation of the token is fed to a fully-connected layer that produces the [CLS] relevance score s of that text with respect to the query. Birch