SlideShare a Scribd company logo
1 of 28
Transformers AI
By, Rahul Kumar
Recurrent Neural
Networks (RNN)
• RNNs are feed-forward neural networks that are
rolled out over time. A RNN can be thought of as
multiple copies of the same network, each
network passing a message to a successor.
An RNN has two major disadvantages
Slow to train Vanishing gradient
Long Short-Term
Memory (LSTM)
• Long short-term memory (LSTM) is a special kind of
RNN, specially made for solving vanishing gradient
problems. They are capable of learning long-term
dependencies.
• LSTMs can selectively remember or forget things
that are important and not so important.
• LSTM is even more slower to train as compared to RNN.
• Vanishing gradient problem still persists.
• Sequential computation inhibits parallelization.
• Distance between positions is linear
Drawbacks of LSTM
Simple RNN
vs LSTM
Encoder-Decoder
machine
translation
Encoder-Decoder LSTM structure for chatting
Attention
To solve some of these problems like vanishing gradient,
researchers created a technique for paying attention to
specific words.
Attention in neural networks is somewhat similar to what we find in
humans. ‘It means they focus on certain parts of the inputs while
the rest gets less emphasis’. Attention highly improved the quality of
machine translation as it allows the model to focus on the relevant
part of the input sequence as necessary.
But some of the problems that we
discussed, still are not solved with
RNNs using attention. For example,
processing inputs (words) in
parallel is not possible. For a large
corpus of text, this increases the
time spent translating the text.
Convolutional Neural Networks (CNN)
1. Convolutional Neural Networks help solve these problems. With them we can-
 Trivial to parallelize (per layer)
 Exploits local dependencies
 Distance between positions is logarithmic
2. Why Transformers?
The problem is that Convolutional Neural Networks do not necessarily help with the
problem of figuring out the problem of dependencies when translating sentences. That’s
why Transformers were created, they are a combination of both CNNs with attention.
Image caption generation using attention
filter filter filter
filter filter filter
match 0.7
CNN
filter filter filter
filter filter filter
z0
A vector for each
region
z0 is initial parameter, it is also learned
filter filter filter
filter filter filter
CNN
filter filter filter
filter filter filter
A vector for each
region
0.7 0.1 0.1
0.1 0.0 0.0
weighted
sum
z1
Word 1
z0
Attention to
a region
filter filter filter
filter filter filter
CNN
filter filter filter
filter filter filter
A vector for each
region
z0
0.0 0.8 0.2
0.0 0.0 0.0
weighted
sum
z1
Word 1
z2
Word 2
Image caption generation using attention
Transformers (Attention is all you need(2017))
1. Solves the problem of parallelization
2. Overcomes vanishing gradient issue
3. Transformers uses self-attention
Convolutional Neural Networks (CNN) + Attention
Using
Based on
Multi-headed Attention layer
High-level
architecture
Encoder Block
• All identical in structure
(yet they do not share
weights).
• Each one is broken down
into two sub-layers
Word → Embedding → Positional Embedding → Final Vector, framed as Context.
• Dependencies in self-attention
layer.
• No dependencies in Feed-forward
layer
Decoder
Block
Complete architecture of transformer
Transformers,
GPT-2, and
BERT
A transformer uses Encoder stack to model
input and uses Decoder stack to model
output (using input information from
encoder side).
But if we do not have input, we just want to
model the “next word”, we can get rid of the
Encoder side of a transformer and output
“next word” one by one. This gives us GPT.
If we are only interested in training a
language model for the input for some other
tasks, then we do not need the Decoder of
the transformer, that gives us BERT.
GPT-2, BERT
1542M
762M
345M
117M parameters
GPT: Released June 2018
GPT-2: Released Nov. 2019 with 1.5B parameters
GPT-3: 175B parameters trained on 45TB texts
BERT (Bidirectional Encoder Representation from Transformers)
• Model input dimension 512
• Input and output vector size
GPT-3
• OpenAI's third-generation Generative Pretrained
Transformer, GPT-3, is a general-purpose language
algorithm that uses machine learning to translate text,
answer questions, and write text predictively.
• It analyzes a series of terms, text and other data, then
elaborates on those examples in order to generate fully
original production in the form of an article or an image.
Working of GPT-3
• GPT-3 has 175 billion
learning parameters that
allow it to perform almost
any task assigned to it,
making it larger than the
second most effective
language model, Microsoft
Corp.'s Turing-NLG
algorithm, which has 17
billion parameters for
learning.
Source: OpenAI

More Related Content

What's hot

What's hot (20)

OpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroOpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
 
Lstm
LstmLstm
Lstm
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
Bert
BertBert
Bert
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 

Similar to Transformers AI PPT.pptx

Long Short Term Memory LSTM
Long Short Term Memory LSTMLong Short Term Memory LSTM
Long Short Term Memory LSTM
Abdullah al Mamun
 
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Lviv Startup Club
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
Kyuri Kim
 

Similar to Transformers AI PPT.pptx (20)

Transformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfTransformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdf
 
Long Short Term Memory LSTM
Long Short Term Memory LSTMLong Short Term Memory LSTM
Long Short Term Memory LSTM
 
Deep learning seminar report
Deep learning seminar reportDeep learning seminar report
Deep learning seminar report
 
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
 
Short story presentation
Short story presentationShort story presentation
Short story presentation
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
 
NLP.pptx
NLP.pptxNLP.pptx
NLP.pptx
 
Deep learning Techniques JNTU R20 UNIT 2
Deep learning Techniques JNTU R20 UNIT 2Deep learning Techniques JNTU R20 UNIT 2
Deep learning Techniques JNTU R20 UNIT 2
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Applying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesApplying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language Services
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
 
IRJET - Speech to Speech Translation using Encoder Decoder Architecture
IRJET -  	  Speech to Speech Translation using Encoder Decoder ArchitectureIRJET -  	  Speech to Speech Translation using Encoder Decoder Architecture
IRJET - Speech to Speech Translation using Encoder Decoder Architecture
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
Advancements in Hindi-English Neural Machine Translation: Leveraging LSTM wit...
Advancements in Hindi-English Neural Machine Translation: Leveraging LSTM wit...Advancements in Hindi-English Neural Machine Translation: Leveraging LSTM wit...
Advancements in Hindi-English Neural Machine Translation: Leveraging LSTM wit...
 
Concepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, AttentionConcepts of Temporal CNN, Recurrent Neural Network, Attention
Concepts of Temporal CNN, Recurrent Neural Network, Attention
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Transformers AI PPT.pptx

  • 2. Recurrent Neural Networks (RNN) • RNNs are feed-forward neural networks that are rolled out over time. A RNN can be thought of as multiple copies of the same network, each network passing a message to a successor.
  • 3. An RNN has two major disadvantages Slow to train Vanishing gradient
  • 4. Long Short-Term Memory (LSTM) • Long short-term memory (LSTM) is a special kind of RNN, specially made for solving vanishing gradient problems. They are capable of learning long-term dependencies. • LSTMs can selectively remember or forget things that are important and not so important.
  • 5. • LSTM is even more slower to train as compared to RNN. • Vanishing gradient problem still persists. • Sequential computation inhibits parallelization. • Distance between positions is linear Drawbacks of LSTM
  • 9. Attention To solve some of these problems like vanishing gradient, researchers created a technique for paying attention to specific words. Attention in neural networks is somewhat similar to what we find in humans. ‘It means they focus on certain parts of the inputs while the rest gets less emphasis’. Attention highly improved the quality of machine translation as it allows the model to focus on the relevant part of the input sequence as necessary.
  • 10. But some of the problems that we discussed, still are not solved with RNNs using attention. For example, processing inputs (words) in parallel is not possible. For a large corpus of text, this increases the time spent translating the text.
  • 11. Convolutional Neural Networks (CNN) 1. Convolutional Neural Networks help solve these problems. With them we can-  Trivial to parallelize (per layer)  Exploits local dependencies  Distance between positions is logarithmic 2. Why Transformers? The problem is that Convolutional Neural Networks do not necessarily help with the problem of figuring out the problem of dependencies when translating sentences. That’s why Transformers were created, they are a combination of both CNNs with attention.
  • 12. Image caption generation using attention filter filter filter filter filter filter match 0.7 CNN filter filter filter filter filter filter z0 A vector for each region z0 is initial parameter, it is also learned
  • 13. filter filter filter filter filter filter CNN filter filter filter filter filter filter A vector for each region 0.7 0.1 0.1 0.1 0.0 0.0 weighted sum z1 Word 1 z0 Attention to a region
  • 14. filter filter filter filter filter filter CNN filter filter filter filter filter filter A vector for each region z0 0.0 0.8 0.2 0.0 0.0 0.0 weighted sum z1 Word 1 z2 Word 2
  • 15. Image caption generation using attention
  • 16. Transformers (Attention is all you need(2017)) 1. Solves the problem of parallelization 2. Overcomes vanishing gradient issue 3. Transformers uses self-attention Convolutional Neural Networks (CNN) + Attention Using Based on Multi-headed Attention layer
  • 18. Encoder Block • All identical in structure (yet they do not share weights). • Each one is broken down into two sub-layers Word → Embedding → Positional Embedding → Final Vector, framed as Context.
  • 19. • Dependencies in self-attention layer. • No dependencies in Feed-forward layer
  • 22. Transformers, GPT-2, and BERT A transformer uses Encoder stack to model input and uses Decoder stack to model output (using input information from encoder side). But if we do not have input, we just want to model the “next word”, we can get rid of the Encoder side of a transformer and output “next word” one by one. This gives us GPT. If we are only interested in training a language model for the input for some other tasks, then we do not need the Decoder of the transformer, that gives us BERT.
  • 24. 1542M 762M 345M 117M parameters GPT: Released June 2018 GPT-2: Released Nov. 2019 with 1.5B parameters GPT-3: 175B parameters trained on 45TB texts
  • 25. BERT (Bidirectional Encoder Representation from Transformers)
  • 26. • Model input dimension 512 • Input and output vector size
  • 27. GPT-3 • OpenAI's third-generation Generative Pretrained Transformer, GPT-3, is a general-purpose language algorithm that uses machine learning to translate text, answer questions, and write text predictively. • It analyzes a series of terms, text and other data, then elaborates on those examples in order to generate fully original production in the form of an article or an image.
  • 28. Working of GPT-3 • GPT-3 has 175 billion learning parameters that allow it to perform almost any task assigned to it, making it larger than the second most effective language model, Microsoft Corp.'s Turing-NLG algorithm, which has 17 billion parameters for learning. Source: OpenAI