SlideShare a Scribd company logo
BERT: Bidirectional
Encoder Representation
from Transformer
By: Shaurya Uppal
Defining Language
Language:- Divided into 3 Parts
● Syntax
● Semantics
● Pragmatics
Syntax- Word Ordering, Sentence form
Semantics- Meaning of word
Pragmatics- refers to the social language skills that
we use in our daily interactions with others.
Example of Syntax, Semantics, Pragmatics
+ This discussion is about BERT.
+ The green frogs sleep soundly.
+ BERT play football good
SSP
SS
None
Why study about BERT?
Bert has ability to perform state of the art performance in many Natural Language
Processing Tasks. It can perform tasks such as Text Classification, Text Similarity
finding, Next Sentence Sequence Prediction, Question Answering, Auto-
Summarization, Named Entity Recognition,etc.
What is BERT Exactly?
BERT is Pretrained model by Google, which is a
bidirectional representation from unlabeled text by
jointly conditioning on both left and right context in
all layers.
Dataset used to Pre-train BERT
+ BooksCorpus (800M words)
+ English Wikipedia (2,500+M words)
A pretrained model can be applied by feature-based approach or fine tuning.
+ In Fine Tuning all weights change.
+ In Feature based approach only the final layer weights change. (Approach by
ELMo)
This pretrained model is then fine tuned on different NLP tasks.
Pretraining and Fine Tuning: You train a model m on Data A, then this model m is
trained on Data B from the checkpoint. SLIDE 17
Language
Training
Approach
To train a language model:
Two approaches
Context Free
+ Traditionally we use to convert word2vec
or use Glove.
Contextual
+ RNN
+ BERT
How does BERT Work?
BERT weights are learned in advance through two unsupervised tasks: masked
language modeling (predicting a missing word given the left and right context)
and next sentence prediction (predicting whether one sentence follows another).
BERT makes use of Transformer, an attention mechanism that learns contextual
relations between words (or sub-words) in a text.
(Paper 2 Attention is all you need)
Multi-headed attention is used in BERT. It uses multiple layers of attention and
also incorporates multiple attention “heads” in every layer (12 or 16). Since model
weights are not shared between layers, a single BERT model effectively has up to
12 x 12 = 144 different attention mechanisms.
What does BERT learn, how it tokenize and handle
OOV?
Consider the input example:- I went to the store. At the store, I bought fresh
strawberries.
BERT uses a WORD PIECE Tokenizer which breaks a OOV(out of vocab) word
into segments. For example, if play, ##ing, and ##ed are present in the
vocabulary but playing and played are OOV words then they will be broken down
into play + ##ing and play + ##ed respectively. (## is used to represent sub-
words).
BERT also requires a [CLS] special classifier token at beginning and [SEP] at end
of a sequence.
[CLS] I went to the store. [SEP] At the store I bought fresh straw ##berries.[SEP]
Attention
An attention probe is a task for a pair of tokens (tokeni, tokenj) where the input is a model-wide attention vector formed by
concatenating the entries aij, in every attention matrix from every attention head in every layer.
Some visual Attention Patterns and Why we use
Attention Mechanism?
Reason for Attention: Attention helps in two main
tasks of BERT, MLM (Masked Language Model)
and NSP (Next Sentence Prediction).
Visual Pattern from Attention mechanism
● Attention to next word. [ Layer 2, Head 0 ] | Backward RNN
● Attention to Previous word. [Layer 0, Head 2 ] | Forward RNN
● Attention to identical/related word.
● Attention to identical words in other sentence. | Helps in nextsentence prediction task
● Attention to other words predictive of word.
● Attention to delimiter tokens [CLS], [SEP]
● Attention to Bag of Words.
How input in BERT is Feeded?
MLM: Masked Language Model
Input: My dog is hairy.
Masking is done randomly, and 15% of all WordPiece tokens in each input
sequence in masked. We only predict the masked tokens rather than predicting
the entire input sequence.
Procedure:
+ 80% of the time: Replace the word with [MASK]. My dog is [MASK].
+ 10% of time: Replace word randomly. My dog is apple.
+ 10% of time: Keep same My dog is hairy.
Why MLM is best?
NSP: Next Sentence Prediction
Training Method:
In unlabelled data, we take a input
sequence A and 50% of time
making next occurring input
sequence as B. Rest 50% of time
we randomly pick any sequence as
B.
BERT Architecture
BERT is a multi-layer bidirectional Transformer encoder.
There are two models introduced in the paper.
● BERT base – 12 layers (transformer blocks), 12
attention heads, and 110 million parameters.
● BERT Large – 24 layers, 16 attention heads and, 340
million parameters.
BERT Pretraining and Fine Tuning Architecture
Illustration how the BERT Pretrain
architecture remain the same and
just the fine tuning layer architecture
change for different NLP tasks.
Related Work
EMLo:- A pretrained model based which is feature based (only final layer weights
change) for NLP tasks. Difference: ELMo uses LSTMS; BERT uses Transformer - an attention
based model with positional encodings to represent word positions). ELMo also failed because is was
word based and could not handle OOV.
OpenAI GPT: uses a left to right architecture where every token can only attend to
previous tokens in the self-attention layer of Transformer. Failed because it could
not get proper contextual knowledge.
How BERT Outperforms others?
In the paper Visualizing and Measuring the Geometry of BERT, we prove how
BERT holds semantic and syntax features of a text.
In this paper aims to show how attention matrix contains grammatical
representations. Turning to semantics, using visualizations of the activations
created by different pieces of text, we show suggestive evidence that BERT
distinguishes word senses at a very fine level.
BERT’s internals consist of two parts. First, an initial embedding for each token is created by combining a
pre-trained word piece embedding with position and segment information. Next, this initial sequence of
embeddings is run through multiple transformer layers, producing a new sequence of context embeddings
at each step. Implicit in each transformer layer is a set of attention matrices, one for each attention head,
each of which contains a scalar value for each ordered pair (tokeni , tokenj ). [SLIDE 11]
Experiment for Syntax Representation
Experiment on corpus of Penn TreeBank (3.1M dependency relations). With
PyStanford Dependency Library we found the grammatical dependency on which
we ran BERT-base through each sentence and obtained model-wide attention
matrix. [ SLIDE 9].
On this dataset we train test split of 30% and achieve an accuracy of 85.8% on
binary probe and 71.9% on multiclass probe.
Proved: Attention mechanism contains syntactic features.
Geometry of Word Sense (Experiment)
On wikipedia articles with a query word we applied nearest-neighbor classifier
where each neighbour is the centroid of a given word sense’s BERT-base
embeddings in training data.
Conclusion
BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language
Processing. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide
range of practical applications in the future.
Tested on our data of SupportLen for Text Classification.
We have a priority column in supportlen where we manually label
whether a customer email is urgent or not.
On this dataset we used BERT-base-uncased.
Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 28996
}
Some FAQs on BERT
1. WHAT IS THE MAXIMUM SEQUENCE LENGTH OF THE INPUT?
512 tokens
2. OPTIMAL VALUES OF THE HYPERPARAMETERS USED IN FINE-TUNING
● Dropout – 0.1
● Batch Size – 16, 32
● Learning Rate (Adam) – 5e-5, 3e-5, 2e-5
● Number of epochs – 3, 4
3. HOW MANY LAYERS ARE FROZEN IN THE FINE-TUNING STEP?
No layers are frozen during fine-tuning. All the pre-trained layers along with the task-specific parameters are trained
simultaneously.
4. IN HOW MUCH TIME BERT WAS PRETRAINED BY GOOGLE?
Google took 4days to Pretrain BERT with 16TPUs.
ULMFiT: Universal Language Model Fine-tuning for Text Classification
ULMFiT paper added a intermediate step in which an intermediate step in which model is fine-tuned on
text from the same domain as the target task. Now, along with BERT Pretrained model classification task
is done, resulting in better accuracy than simply using BERT model alone. We too fine-tune bert on our
custom data. It took around 50mins per epoch on Telsa K80 12GB GPU on P2.
Future work and use cases that BERT can solve for us
+ Email Prioritization
+ Sentiment Analysis of Reviews
+ Review Tagging
+ Question-Answering for ChatBot & Community
+ Similar Products problem, we currently use cosine similarity on description
text.
Testing of ULMFiT Experiment to be done, by fine tuning BERT on our domain
dataset.

More Related Content

What's hot

BERT
BERTBERT
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
bhavesh_physics
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
JEE HYUN PARK
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
Arvind Devaraj
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Ding Li
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
gohyunwoong
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Young Seok Kim
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
Senthil Kumar M
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
RahulKumar854607
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
Devashish Shanker
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
Young Seok Kim
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Deep Learning Italia
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
Fwdays
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
TuCaoMinh2
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
Illia Polosukhin
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
Jeong-Gwan Lee
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
Bhaskar Mitra
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
Yuta Niki
 

What's hot (20)

BERT
BERTBERT
BERT
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 

Similar to NLP State of the Art | BERT

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
Kyuri Kim
 
Transformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfTransformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdf
helloworld28847
 
ENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKING
ijasuc
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
ijaia
 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Model
gerogepatton
 
Nltk
NltkNltk
Nltk
Anirudh
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
saurav singla
 
"the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar."the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar.
Vladimir Ulogov
 
NLP
NLPNLP
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
zperjaccico
 
Chatbot ppt
Chatbot pptChatbot ppt
Chatbot ppt
Manish Mishra
 
BERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdfBERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdf
sudeshnakundu10
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
AbdurrahimDerric
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Association for Computational Linguistics
 
Tokenization and how to use it from scratch
Tokenization and how to use it from scratchTokenization and how to use it from scratch
Tokenization and how to use it from scratch
Mahmoud Yasser
 
visualizing and measuring the geometry of bert
visualizing and measuring the geometry of bertvisualizing and measuring the geometry of bert
visualizing and measuring the geometry of bert
taeseon ryu
 
Beginning text analysis
Beginning text analysisBeginning text analysis
Beginning text analysis
Barry DeCicco
 

Similar to NLP State of the Art | BERT (20)

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
Transformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfTransformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdf
 
ENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKING
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Model
 
Nltk
NltkNltk
Nltk
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
 
"the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar."the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar.
 
NLP
NLPNLP
NLP
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
 
Chatbot ppt
Chatbot pptChatbot ppt
Chatbot ppt
 
BERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdfBERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdf
 
Chatbot_Presentation
Chatbot_PresentationChatbot_Presentation
Chatbot_Presentation
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
Khmer TTS
Khmer TTSKhmer TTS
Khmer TTS
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Tokenization and how to use it from scratch
Tokenization and how to use it from scratchTokenization and how to use it from scratch
Tokenization and how to use it from scratch
 
visualizing and measuring the geometry of bert
visualizing and measuring the geometry of bertvisualizing and measuring the geometry of bert
visualizing and measuring the geometry of bert
 
Beginning text analysis
Beginning text analysisBeginning text analysis
Beginning text analysis
 
Y24168171
Y24168171Y24168171
Y24168171
 

Recently uploaded

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 

Recently uploaded (20)

Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 

NLP State of the Art | BERT

  • 1. BERT: Bidirectional Encoder Representation from Transformer By: Shaurya Uppal
  • 2. Defining Language Language:- Divided into 3 Parts ● Syntax ● Semantics ● Pragmatics Syntax- Word Ordering, Sentence form Semantics- Meaning of word Pragmatics- refers to the social language skills that we use in our daily interactions with others.
  • 3. Example of Syntax, Semantics, Pragmatics + This discussion is about BERT. + The green frogs sleep soundly. + BERT play football good SSP SS None
  • 4. Why study about BERT? Bert has ability to perform state of the art performance in many Natural Language Processing Tasks. It can perform tasks such as Text Classification, Text Similarity finding, Next Sentence Sequence Prediction, Question Answering, Auto- Summarization, Named Entity Recognition,etc. What is BERT Exactly? BERT is Pretrained model by Google, which is a bidirectional representation from unlabeled text by jointly conditioning on both left and right context in all layers.
  • 5. Dataset used to Pre-train BERT + BooksCorpus (800M words) + English Wikipedia (2,500+M words) A pretrained model can be applied by feature-based approach or fine tuning. + In Fine Tuning all weights change. + In Feature based approach only the final layer weights change. (Approach by ELMo) This pretrained model is then fine tuned on different NLP tasks. Pretraining and Fine Tuning: You train a model m on Data A, then this model m is trained on Data B from the checkpoint. SLIDE 17
  • 6. Language Training Approach To train a language model: Two approaches Context Free + Traditionally we use to convert word2vec or use Glove. Contextual + RNN + BERT
  • 7. How does BERT Work? BERT weights are learned in advance through two unsupervised tasks: masked language modeling (predicting a missing word given the left and right context) and next sentence prediction (predicting whether one sentence follows another). BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. (Paper 2 Attention is all you need) Multi-headed attention is used in BERT. It uses multiple layers of attention and also incorporates multiple attention “heads” in every layer (12 or 16). Since model weights are not shared between layers, a single BERT model effectively has up to 12 x 12 = 144 different attention mechanisms.
  • 8. What does BERT learn, how it tokenize and handle OOV? Consider the input example:- I went to the store. At the store, I bought fresh strawberries. BERT uses a WORD PIECE Tokenizer which breaks a OOV(out of vocab) word into segments. For example, if play, ##ing, and ##ed are present in the vocabulary but playing and played are OOV words then they will be broken down into play + ##ing and play + ##ed respectively. (## is used to represent sub- words). BERT also requires a [CLS] special classifier token at beginning and [SEP] at end of a sequence. [CLS] I went to the store. [SEP] At the store I bought fresh straw ##berries.[SEP]
  • 9. Attention An attention probe is a task for a pair of tokens (tokeni, tokenj) where the input is a model-wide attention vector formed by concatenating the entries aij, in every attention matrix from every attention head in every layer.
  • 10. Some visual Attention Patterns and Why we use Attention Mechanism? Reason for Attention: Attention helps in two main tasks of BERT, MLM (Masked Language Model) and NSP (Next Sentence Prediction).
  • 11. Visual Pattern from Attention mechanism ● Attention to next word. [ Layer 2, Head 0 ] | Backward RNN ● Attention to Previous word. [Layer 0, Head 2 ] | Forward RNN ● Attention to identical/related word. ● Attention to identical words in other sentence. | Helps in nextsentence prediction task ● Attention to other words predictive of word. ● Attention to delimiter tokens [CLS], [SEP] ● Attention to Bag of Words.
  • 12. How input in BERT is Feeded?
  • 13. MLM: Masked Language Model Input: My dog is hairy. Masking is done randomly, and 15% of all WordPiece tokens in each input sequence in masked. We only predict the masked tokens rather than predicting the entire input sequence. Procedure: + 80% of the time: Replace the word with [MASK]. My dog is [MASK]. + 10% of time: Replace word randomly. My dog is apple. + 10% of time: Keep same My dog is hairy.
  • 14. Why MLM is best?
  • 15. NSP: Next Sentence Prediction Training Method: In unlabelled data, we take a input sequence A and 50% of time making next occurring input sequence as B. Rest 50% of time we randomly pick any sequence as B.
  • 16. BERT Architecture BERT is a multi-layer bidirectional Transformer encoder. There are two models introduced in the paper. ● BERT base – 12 layers (transformer blocks), 12 attention heads, and 110 million parameters. ● BERT Large – 24 layers, 16 attention heads and, 340 million parameters.
  • 17. BERT Pretraining and Fine Tuning Architecture
  • 18. Illustration how the BERT Pretrain architecture remain the same and just the fine tuning layer architecture change for different NLP tasks.
  • 19. Related Work EMLo:- A pretrained model based which is feature based (only final layer weights change) for NLP tasks. Difference: ELMo uses LSTMS; BERT uses Transformer - an attention based model with positional encodings to represent word positions). ELMo also failed because is was word based and could not handle OOV. OpenAI GPT: uses a left to right architecture where every token can only attend to previous tokens in the self-attention layer of Transformer. Failed because it could not get proper contextual knowledge.
  • 20. How BERT Outperforms others? In the paper Visualizing and Measuring the Geometry of BERT, we prove how BERT holds semantic and syntax features of a text. In this paper aims to show how attention matrix contains grammatical representations. Turning to semantics, using visualizations of the activations created by different pieces of text, we show suggestive evidence that BERT distinguishes word senses at a very fine level. BERT’s internals consist of two parts. First, an initial embedding for each token is created by combining a pre-trained word piece embedding with position and segment information. Next, this initial sequence of embeddings is run through multiple transformer layers, producing a new sequence of context embeddings at each step. Implicit in each transformer layer is a set of attention matrices, one for each attention head, each of which contains a scalar value for each ordered pair (tokeni , tokenj ). [SLIDE 11]
  • 21. Experiment for Syntax Representation Experiment on corpus of Penn TreeBank (3.1M dependency relations). With PyStanford Dependency Library we found the grammatical dependency on which we ran BERT-base through each sentence and obtained model-wide attention matrix. [ SLIDE 9]. On this dataset we train test split of 30% and achieve an accuracy of 85.8% on binary probe and 71.9% on multiclass probe. Proved: Attention mechanism contains syntactic features.
  • 22. Geometry of Word Sense (Experiment) On wikipedia articles with a query word we applied nearest-neighbor classifier where each neighbour is the centroid of a given word sense’s BERT-base embeddings in training data.
  • 23. Conclusion BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language Processing. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide range of practical applications in the future. Tested on our data of SupportLen for Text Classification. We have a priority column in supportlen where we manually label whether a customer email is urgent or not. On this dataset we used BERT-base-uncased. Model config { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "type_vocab_size": 2, "vocab_size": 28996 }
  • 24. Some FAQs on BERT 1. WHAT IS THE MAXIMUM SEQUENCE LENGTH OF THE INPUT? 512 tokens 2. OPTIMAL VALUES OF THE HYPERPARAMETERS USED IN FINE-TUNING ● Dropout – 0.1 ● Batch Size – 16, 32 ● Learning Rate (Adam) – 5e-5, 3e-5, 2e-5 ● Number of epochs – 3, 4 3. HOW MANY LAYERS ARE FROZEN IN THE FINE-TUNING STEP? No layers are frozen during fine-tuning. All the pre-trained layers along with the task-specific parameters are trained simultaneously. 4. IN HOW MUCH TIME BERT WAS PRETRAINED BY GOOGLE? Google took 4days to Pretrain BERT with 16TPUs.
  • 25. ULMFiT: Universal Language Model Fine-tuning for Text Classification ULMFiT paper added a intermediate step in which an intermediate step in which model is fine-tuned on text from the same domain as the target task. Now, along with BERT Pretrained model classification task is done, resulting in better accuracy than simply using BERT model alone. We too fine-tune bert on our custom data. It took around 50mins per epoch on Telsa K80 12GB GPU on P2.
  • 26. Future work and use cases that BERT can solve for us + Email Prioritization + Sentiment Analysis of Reviews + Review Tagging + Question-Answering for ChatBot & Community + Similar Products problem, we currently use cosine similarity on description text. Testing of ULMFiT Experiment to be done, by fine tuning BERT on our domain dataset.

Editor's Notes

  1. Language discussion
  2. Examples
  3. BERT's power What is bert
  4. Data used and pretrain vs finetune
  5. Talk about:- Bank of the River Bank account
  6. As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).
  7. Word Piece Tokenizer: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf
  8. Attention Visual:- https://colab.research.google.com/drive/1Nlhh2vwlQdKleNMqpmLDBsAwrv_7NnrB
  9. Understanding the Attention Patterns: https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77
  10. A positional embedding is also added to each token to indicate its position in the sequence.
  11. Advantage of this method is that the Transformer Does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of every token.
  12. NSP helps in Q&A and understand the relation b/w sentences.
  13. State of the Art: the most recent stage in the development of a product, incorporating the newest ideas and features. Parse Tree Embedding Concept- mathematical proof
  14. Miscellaneous:- Matthew Correlation Coefficient: https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
  15. Miscellaneous: What is a TPU? https://www.google.com/search?q=tpu+full+form&rlz=1C5CHFA_enIN835IN835&oq=TPU+full+form&aqs=chrome.0.0l6.3501j0j9&sourceid=chrome&ie=UTF-8 Which Activation is used in BERT? https://datascience.stackexchange.com/questions/49522/what-is-gelu-activation Gaussian Error Linear Unit
  16. Demo of Google Collab: Sentiment on Movie Reviews: https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb?authuser=1#scrollTo=VG3lQz_j2BtD Sentence Pairing and Sentence Classification: https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb?authuser=1#scrollTo=0yamCRHcV-nQ BERT FineTune on Data. https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning