240115_Attention Is All You Need (2017 NIPS).pptx

•Download as PPTX, PDF•

0 likes•10 views

thanhdowork

Attention Is All You Need

Education

Min-Seo Kim
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: kms39273@naver.com

1
Previous work
RNN(Recurrent Neural Network)
• Utilizes the structure of RNN, which is suitable for processing sequence data or time-series data.
• RNN incorporates past information into current decisions, enabling the understanding of the continuity and
context of data over time.

2
Previous work
LSTM (Long Short-Term Memory)
• LSTM emerged as a solution to the problems of long-term dependencies, where in vanilla RNNs, information
from earlier time steps fails to be sufficiently transmitted to later stages as the sequence lengthens.

3
Previous work
GRU (Gated Recurrent Unit)
• While LSTM requires considerable computing power due to the presence of four neural networks within a
single cell, GRU emerged as an improvement, implementing a similar mechanism with only three neural
networks.

4
Background
• To address the bottleneck issue caused by a single, fixed-size context vector, there has been a shift towards
machine translation approaches that move beyond the RNN-based framework.
Problem with the Encoder-Decoder Model

5
Methodology
Methodology
• Does not use networks that consider sequence order, such as RNN or CNN.
• Utilizes positional encoding to account for the position, and employs self-
attention techniques separately to consider context. encoder
decoder

6
Baseline
Scaled Dot-Product Attention
• Takes Query (Q), Key (K), and Value (V) as
inputs.

7
Baseline
Multi-Head Attention
• More efficient than using a single attention function. It involves mapping queries,
keys, and values through linear projections to intermediate representations. This
process creates multiple attention functions, each with different sets of inputs.

8
Experiments
English-to-German translation task (WMT 2014)
• Measures the performance of translations by comparing how similar machine-translated results are to those
translated by humans.
• It is observed that the Transformer demonstrates higher performance compared to other models, while also
having a lower training cost.

10
Experiments
English Constituency Parsing
• To test the Transformer's effectiveness in other tasks, it has been applied to the English Constituency Parsing
task.
• Constituency Parsing involves classifying words according to their grammatical constituents.
• Despite not being specifically tuned for this task, the Transformer demonstrates good performance.

11
Paper review
• The Transformer replaces the recurrent layers commonly used in encoder-decoder architectures with multi-
headed self-attention.
• For translation tasks, the Transformer can be trained much faster than architectures based on recurrent or
convolutional layers.
• There is great anticipation for the future of attention-based models, and plans are in place to apply them to
other tasks.
• Plans include extending the Transformer to handle input and output modalities beyond text, and exploring
local, restricted attention mechanisms to efficiently process large inputs and outputs such as images, audio,
and video.
• Another research goal is to make the generation process less sequential.
Conclusions

Similar to 240115_Attention Is All You Need (2017 NIPS).pptx

Deep Learning for Machine TranslationMatīss ‎‎‎‎‎‎‎

Convolutional Neural Networks for Natural Language Processing / Stanford cs22...changedaeoh

Story story pptPooja Patil

MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DEAravind NC

PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee

Electi Deep Learning OptimizationNikolas Markou

Scope of parallelismSyed Zaid Irshad

presentation.pptMadhuriChandanbatwe

Coding For Cores - C# WayBishnu Rawal

Attention Is All You NeedSEMINARGROOT

Data Parallel and Object Oriented ModelNikhil Sharma

Chap2 slidesBaliThorat1

PMSCS 657_Parallel and Distributed processingMd. Mashiur Rahman

Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...Impetus Technologies

TensorFlow.pptxJayesh Patil

Pretzel: optimized Machine Learning framework for low-latency and high throu...NECST Lab @ Politecnico di Milano

Natural Language Processing Advancements By Deep Learning: A SurveyRimzim Thube

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...kevig

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...kevig

Similar to 240115_Attention Is All You Need (2017 NIPS).pptx (20)

Deep Learning for Machine Translation

Convolutional Neural Networks for Natural Language Processing / Stanford cs22...

Story story ppt

MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE

PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...

Electi Deep Learning Optimization

Scope of parallelism

presentation.ppt

Coding For Cores - C# Way

Attention Is All You Need

Data Parallel and Object Oriented Model

Chap2 slides

PMSCS 657_Parallel and Distributed processing

Deep Learning: Evolution of ML from Statistical to Brain-like Computing- Data...

TensorFlow.pptx

Pretzel: optimized Machine Learning framework for low-latency and high throu...

Natural Language Processing Advancements By Deep Learning: A Survey

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...

ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...

Recently uploaded

9953330565 Low Rate Call Girls In Rohini Delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari

Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton

Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke

TataKelola dan KamSiber Kecerdasan Buatan v022.pdfSarwono Sutikno, Dr.Eng.,CISA,CISSP,CISM,CSX-F

Crayon Activity Handout For the Crayon AUnboundStockton

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching

Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam

Alper Gobel In Media Res Media ComponentInMediaRes1

Mastering the Unannounced Regulatory InspectionSafetyChain Software

Biting mechanism of poisonous snakes.pdfadityarao40181

Proudly South Africa powerpoint Thorisha.pptxthorishapillay1

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar

Paris 2024 Olympic Geographies - an activityGeoBlogs

CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2

History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi

Introduction to AI in Higher Education_draft.pptxpboyjonauth

Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos

EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3

Recently uploaded (20)

9953330565 Low Rate Call Girls In Rohini Delhi NCR

Employee wellbeing at the workplace.pptx

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf

Blooming Together_ Growing a Community Garden Worksheet.docx

Painted Grey Ware.pptx, PGW Culture of India

TataKelola dan KamSiber Kecerdasan Buatan v022.pdf

Crayon Activity Handout For the Crayon A

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...

Pharmacognosy Flower 3. Compositae 2023.pdf

Alper Gobel In Media Res Media Component

Mastering the Unannounced Regulatory Inspection

Biting mechanism of poisonous snakes.pdf

Proudly South Africa powerpoint Thorisha.pptx

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx

Paris 2024 Olympic Geographies - an activity

CARE OF CHILD IN INCUBATOR..........pptx

History Class XII Ch. 3 Kinship, Caste and Class (1).pptx

Introduction to AI in Higher Education_draft.pptx

Final demo Grade 9 for demo Plan dessert.pptx

EPANDING THE CONTENT OF AN OUTLINE using notes.pptx

240115_Attention Is All You Need (2017 NIPS).pptx

1. Min-Seo Kim Network Science Lab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: kms39273@naver.com

2. 1 Previous work RNN(Recurrent Neural Network) • Utilizes the structure of RNN, which is suitable for processing sequence data or time-series data. • RNN incorporates past information into current decisions, enabling the understanding of the continuity and context of data over time.

3. 2 Previous work LSTM (Long Short-Term Memory) • LSTM emerged as a solution to the problems of long-term dependencies, where in vanilla RNNs, information from earlier time steps fails to be sufficiently transmitted to later stages as the sequence lengthens.

4. 3 Previous work GRU (Gated Recurrent Unit) • While LSTM requires considerable computing power due to the presence of four neural networks within a single cell, GRU emerged as an improvement, implementing a similar mechanism with only three neural networks.

5. 4 Background • To address the bottleneck issue caused by a single, fixed-size context vector, there has been a shift towards machine translation approaches that move beyond the RNN-based framework. Problem with the Encoder-Decoder Model

6. 5 Methodology Methodology • Does not use networks that consider sequence order, such as RNN or CNN. • Utilizes positional encoding to account for the position, and employs self- attention techniques separately to consider context. encoder decoder

7. 6 Baseline Scaled Dot-Product Attention • Takes Query (Q), Key (K), and Value (V) as inputs.

8. 7 Baseline Multi-Head Attention • More efficient than using a single attention function. It involves mapping queries, keys, and values through linear projections to intermediate representations. This process creates multiple attention functions, each with different sets of inputs.

9. 8 Experiments English-to-German translation task (WMT 2014) • Measures the performance of translations by comparing how similar machine-translated results are to those translated by humans. • It is observed that the Transformer demonstrates higher performance compared to other models, while also having a lower training cost.

10. 9 Experiments Model Variation

11. 10 Experiments English Constituency Parsing • To test the Transformer's effectiveness in other tasks, it has been applied to the English Constituency Parsing task. • Constituency Parsing involves classifying words according to their grammatical constituents. • Despite not being specifically tuned for this task, the Transformer demonstrates good performance.

12. 11 Paper review • The Transformer replaces the recurrent layers commonly used in encoder-decoder architectures with multi- headed self-attention. • For translation tasks, the Transformer can be trained much faster than architectures based on recurrent or convolutional layers. • There is great anticipation for the future of attention-based models, and plans are in place to apply them to other tasks. • Plans include extending the Transformer to handle input and output modalities beyond text, and exploring local, restricted attention mechanisms to efficiently process large inputs and outputs such as images, audio, and video. • Another research goal is to make the generation process less sequential. Conclusions

Editor's Notes

RNN based encoder 입력
Y t-1 이전단어 S t hidden state C context-vector
Y t-1 이전단어 S t hidden state C context-vector
RNNencdec-30 attention을 적용하지 않은 baseline Search – 어텐션 적용
RNNencdec-30 attention을 적용하지 않은 baseline Search – 어텐션 적용
RNNencdec-30 attention을 적용하지 않은 baseline Search – 어텐션 적용

240115_Attention Is All You Need (2017 NIPS).pptx

Recommended

Recommended

More Related Content

Similar to 240115_Attention Is All You Need (2017 NIPS).pptx

Similar to 240115_Attention Is All You Need (2017 NIPS).pptx (20)

More from thanhdowork

More from thanhdowork (20)

Recently uploaded

Recently uploaded (20)

240115_Attention Is All You Need (2017 NIPS).pptx

Editor's Notes