Transformers

•

1 like•223 views

A detailed overview on the computations and the working of the transformer network presented in the paper - " Attention is all you need"

Science

Transformers
By
Shivam Mishra
Anup Joseph
Akansha Goyal
Attention is All you need

Background and Motivation
Main ideas
Agenda
Training
Results
Architecture

Background
RNN,LSTM techniques have been SOTA in sequence modeling.
The sequential nature prevents parallelization especially when
sequence length is long
Former techniques are not good at parallelization
Difficulty in learning dependencies between distant
positions
Challenge of Computational Cost

Motivation
The Transformer architecture is aimed
at the problem of sequence
transduction where the goal is to
design a single framework to handle as
many sequences as possible.
It reduces the number of sequential
operations to relate two symbols from
input/output sequences to a constant
number of operations.

Motivation
Transformer achieve parallelization by
replacing recurrence with attention
and encoding the symbol position in
sequence,which in turn leads to
significantly shorter training time.
It eliminates not only recurrence but
also convolution in favor of applying
self-attention , additionally providing
more space for parallelization

Transformers
Architecture
It's complicated!

Self-Attention
Multi-headed Self-Attention
Positional Encoding
Residuals

Self- Attention is the mechanism
used by Transformers to
associate one word with another
Self-Attention
I am a student

In this sentence self-attention
allows the model to relate I to
student

Positional encoding is the Transformers technique to account
for the order of words in the sequence

Decoders
Almost the same as
encoders with a twist

Training
The objective was the translation task, performed for
translations from English to French and German.
Sentence Pairs - 4.5M, 36M ; Tokens - 37 K, 32 K
Sentence pairs batched together by approximate
sequence length
Base model trained for 100,00 steps and big models
trained for 300,000 steps.
Varied learning rate; optimizer parameters are nearly
the same as default ones .
Two residual dropouts, and label smoothing

For the English - German translation task, the big
transformer model outperformed every other model ,
establishing a new BLEU score.
The base models surpassed other previous models and
ensembles with relatively less training cost.
For the English - French translation task, the big model
outperformed previous models at 1/4th the relative
training cost, with dropout 0.1 (rather than 0.3).
Hyperparameters such as beam size (4) , penalty length
(0.6) are chosen after experimenting on the development
set.
Maximum output length during inference is ip_length +
50, but terminate early when possible.
Results

Model Variations, Performance
Varied the base model to evaluate importance of different components.
Single head attention is 0.9 BLEU worst than best setting, quality drops off with too many heads.

References
[1]. Attention is All you need
https://arxiv.org/abs/1706.03762
[2]. The Illustrated Transformer - Jay Alammar
https://jalammar.github.io/illustrated-transformer/
[3]. Illustrated Guide to Transformers- Step by Step Explanation - Micheal Phi
https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-
f74876522bc0

What's hot

BERT Finetuning Webinar Presentationbhavesh_physics

NLP State of the Art | BERTshaurya uppal

Transformers AI PPT.pptxRahulKumar854607

Transformer Introduction (Seminar Material)Yuta Niki

BertAbdallah Bashir

[Paper review] BERTJEE HYUN PARK

Demystifying NLP Transformers: Understanding the Power and Architecture behin...NILESH VERMA

BERT (v3).pptxakram596384

Deep learning for NLP and TransformerArvind Devaraj

Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev

Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Universitat Politècnica de Catalunya

Introduction to Transformer ModelNuwan Sriyantha Bandara

1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow

Fine tune and deploy Hugging Face NLP modelsOVHcloud

Nlp and transformer (v3s)H K Yoon

Introduction to Named Entity RecognitionTomer Lieber

Thomas Wolf "Transfer learning in NLP"Fwdays

Word embeddings, RNN, GRU and LSTMDivya Gera

Transformers in 2021Grigory Sapunov

BERTKhang Pham

What's hot (20)

BERT Finetuning Webinar Presentation

NLP State of the Art | BERT

Transformers AI PPT.pptx

Transformer Introduction (Seminar Material)

Bert

[Paper review] BERT

Demystifying NLP Transformers: Understanding the Power and Architecture behin...

BERT (v3).pptx

Deep learning for NLP and Transformer

Introduction to Transformers for NLP - Olga Petrova

Attention is all you need (UPC Reading Group 2018, by Santi Pascual)

Introduction to Transformer Model

1909 BERT: why-and-how (CODE SEMINAR)

Fine tune and deploy Hugging Face NLP models

Nlp and transformer (v3s)

Introduction to Named Entity Recognition

Thomas Wolf "Transfer learning in NLP"

Word embeddings, RNN, GRU and LSTM

Transformers in 2021

BERT

Similar to Transformers

Attention Is All You NeedSEMINARGROOT

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...taeseon ryu

Sequence to sequence model speech recognitionAditya Kumar Khare

Automated Essay Scoring Using Efficient Transformer-Based Language ModelsNat Rice

presentation.pptMadhuriChandanbatwe

Semantic Mask for Transformer Based End-to-End Speech RecognitionWhenty Ariyanti

Transfer Learning in NLP: A SurveyNUPUR YADAV

third_seminarParakrant Sarkar

Comparative Analysis of Transformer Based Pre-Trained NLP Modelssaurav singla

Sepformer&DPTNet.pdfssuser849b73

Automated Speech Recognition Pruthvij Thakar

Electi Deep Learning OptimizationNikolas Markou

Parallel Execution of Model Management Programs (STAF 2017)Sina Madani

ENSEMBLE MODEL FOR CHUNKINGijasuc

240115_Attention Is All You Need (2017 NIPS).pptxthanhdowork

Local Applications of Large Language Models based on RAG.pptxlwz614595250

AutoML lectures (ACDL 2019)Joaquin Vanschoren

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail

LLM.pdfMedBelatrach

How to fine-tune and develop your own large language model.pptxKnoldus Inc.

Similar to Transformers (20)

Attention Is All You Need

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...

Sequence to sequence model speech recognition

Automated Essay Scoring Using Efficient Transformer-Based Language Models

presentation.ppt

Semantic Mask for Transformer Based End-to-End Speech Recognition

Transfer Learning in NLP: A Survey

third_seminar

Comparative Analysis of Transformer Based Pre-Trained NLP Models

Sepformer&DPTNet.pdf

Automated Speech Recognition

Electi Deep Learning Optimization

Parallel Execution of Model Management Programs (STAF 2017)

ENSEMBLE MODEL FOR CHUNKING

240115_Attention Is All You Need (2017 NIPS).pptx

Local Applications of Large Language Models based on RAG.pptx

AutoML lectures (ACDL 2019)

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF

LLM.pdf

How to fine-tune and develop your own large language model.pptx

Recently uploaded

Isotopic evidence of long-lived volcanism on IoSérgio Sacani

Zoology 4th semester series (krishna).pdfSumit Kumar yadav

Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani

Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari

Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314

STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani

The Philosophy of ScienceUniversity of Hertfordshire

Is RISC-V ready for HPC workload? Maybe?Patrick Diehl

Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav

Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani

GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji

Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl

Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter

Natural Polymer Based NanomaterialsAArockiyaNisha

Recently uploaded (20)

Isotopic evidence of long-lived volcanism on Io

Zoology 4th semester series (krishna).pdf

Presentation Vikram Lander by Vedansh Gupta.pptx

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...

Recombinant DNA technology (Immunological screening)

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...

Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...

STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...

The Philosophy of Science

Is RISC-V ready for HPC workload? Maybe?

Botany 4th semester file By Sumit Kumar yadav.pdf

Broad bean, Lima Bean, Jack bean, Ullucus.pptx

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b

GFP in rDNA Technology (Biotechnology).pptx

Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |

Artificial Intelligence In Microbiology by Dr. Prince C P

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx

Natural Polymer Based Nanomaterials

Transformers

1. Transformers By Shivam Mishra Anup Joseph Akansha Goyal Attention is All you need

2. Background and Motivation Main ideas Agenda Training Results Architecture

3. Background RNN,LSTM techniques have been SOTA in sequence modeling. The sequential nature prevents parallelization especially when sequence length is long Former techniques are not good at parallelization Difficulty in learning dependencies between distant positions Challenge of Computational Cost

4. Motivation The Transformer architecture is aimed at the problem of sequence transduction where the goal is to design a single framework to handle as many sequences as possible. It reduces the number of sequential operations to relate two symbols from input/output sequences to a constant number of operations.

5. Motivation Transformer achieve parallelization by replacing recurrence with attention and encoding the symbol position in sequence,which in turn leads to significantly shorter training time. It eliminates not only recurrence but also convolution in favor of applying self-attention , additionally providing more space for parallelization

6. Transformers Architecture It's complicated!

7. Self-Attention Multi-headed Self-Attention Positional Encoding Residuals

8. Self- Attention is the mechanism used by Transformers to associate one word with another Self-Attention I am a student In this sentence self-attention allows the model to relate I to student

9. Self-Attention Calculation

10. Self-Attention Calculation

11. Multi-headed Self-Attention

12. Multi-headed Self-Attention

13. Positional encoding is the Transformers technique to account for the order of words in the sequence

14. Positional Encoding

15. Residual Connections

16. Decoders Almost the same as encoders with a twist

17. Lookahead Mask

18. Decoders Coming back to decoders

19. Final Classification

20. Final Architecture

21. Training The objective was the translation task, performed for translations from English to French and German. Sentence Pairs - 4.5M, 36M ; Tokens - 37 K, 32 K Sentence pairs batched together by approximate sequence length Base model trained for 100,00 steps and big models trained for 300,000 steps. Varied learning rate; optimizer parameters are nearly the same as default ones . Two residual dropouts, and label smoothing

22. For the English - German translation task, the big transformer model outperformed every other model , establishing a new BLEU score. The base models surpassed other previous models and ensembles with relatively less training cost. For the English - French translation task, the big model outperformed previous models at 1/4th the relative training cost, with dropout 0.1 (rather than 0.3). Hyperparameters such as beam size (4) , penalty length (0.6) are chosen after experimenting on the development set. Maximum output length during inference is ip_length + 50, but terminate early when possible. Results

23. Model Variations, Performance Varied the base model to evaluate importance of different components. Single head attention is 0.9 BLEU worst than best setting, quality drops off with too many heads.

24. References [1]. Attention is All you need https://arxiv.org/abs/1706.03762 [2]. The Illustrated Transformer - Jay Alammar https://jalammar.github.io/illustrated-transformer/ [3]. Illustrated Guide to Transformers- Step by Step Explanation - Micheal Phi https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation- f74876522bc0

Transformers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Transformers

Similar to Transformers (20)

Recently uploaded

Recently uploaded (20)

Transformers