Attention is all you need
Whi Kwon
소개
2008 ~ 2015: 화공생명공학과
2015 ~ 2017: 품질 / 고객지원 엔지니어
2017 ~ 2018: 딥러닝 자유롭게 공부
2018 ~: 의료 분야 스타트업
관심사
~2017.12: Vision, NLP
~2018.06: RL, GAN
2018.06~: Relational, Imitation
Outline
Part.1: Attention
Part.2: Self-Attention
Part 1. Attention
Attention, also referred to as enthrallment, is the behavioral and cognitive process
of selectively concentrating on a discrete aspect of information, whether deemed
subjective or objective, while ignoring other perceivable information. It is a state of
arousal. . It is the taking possession by the mind in clear and vivid form of one out
of what seem several simultaneous objects or trains of thought. Focalization, the
concentration of consciousness, is of its essence. Attention or enthrallment or
attention has also been described as the allocation of limited cognitive processing
resources.
Attention, also referred to as enthrallment, is the behavioral and cognitive process
of selectively concentrating on a discrete aspect of information, whether deemed
subjective or objective, while ignoring other perceivable information. It is a state of
arousal. . It is the taking possession by the mind in clear and vivid form of one out
of what seem several simultaneous objects or trains of thought. Focalization, the
concentration of consciousness, is of its essence. Attention or enthrallment or
attention has also been described as the allocation of limited cognitive processing
resources.
Attention, also referred to as enthrallment, is the behavioral and cognitive process
of selectively concentrating on a discrete aspect of information, whether deemed
subjective or objective, while ignoring other perceivable information. It is a state of
arousal. It is the taking possession by the mind in clear and vivid form of one out of
what seem several simultaneous objects or trains of thought. Focalization, the
concentration of consciousness, is of its essence. Attention or enthrallment or
attention has also been described as the allocation of limited cognitive processing
resources.
Recurrent Neural Network
...
attention also referred resources
문제 : Non-parallel computation, not long-range dependencies
Attention, also referred to as enthrallment, is the behavioral and cognitive process
of selectively concentrating on a discrete aspect of information, whether deemed
subjective or objective, while ignoring other perceivable information. It is a state of
arousal. It is the taking possession by the mind in clear and vivid form of one out of
what seem several simultaneous objects or trains of thought. Focalization, the
concentration of consciousness, is of its essence. Attention or enthrallment or
attention has also been described as the allocation of limited cognitive processing
resources.
Convolution Neural Network
attention also ... cognitive process
of selectively ... whether deemed
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
concentrat of ... enthrallment or
attention has ... cognitive processing
Filter
문제 : Not long-range dependencies, computationally inefficient
Attention, also referred to as enthrallment, is the behavioral and cognitive process
of selectively concentrating on a discrete aspect of information, whether deemed
subjective or objective, while ignoring other perceivable information. It is a state of
arousal. . It is the taking possession by the mind in clear and vivid form of one out
of what seem several simultaneous objects or trains of thought. Focalization, the
concentration of consciousness, is of its essence. Attention or enthrallment or
attention has also been described as the allocation of limited cognitive processing
resources.
Attention mechanism
Parallel computation, long-range dependencies, explainable
Attention mechanism
Fig. from Vaswani et al. Attention is all you need. ArXiv. 2017
1. Q 와 K 간의 유사도를 구합니다 .
Attention mechanism
Fig. from Vaswani et al. Attention is all you need. ArXiv. 2017
2. 너무 큰 값이 지배적이지 않도록 normalize
1. Q 와 K 간의 유사도를 구합니다 .
Attention mechanism
Fig. from Vaswani et al. Attention is all you need. ArXiv. 2017
2. 너무 큰 값이 지배적이지 않도록 normalize
1. Q 와 K 간의 유사도를 구합니다 .
3. 유사도 → 가중치 ( 총 합 =1)
Attention mechanism
Fig. from Vaswani et al. Attention is all you need. ArXiv. 2017
2. 너무 큰 값이 지배적이지 않도록 normalize
3. 유사도 → 가중치 ( 총 합 =1)
1. Q 와 K 간의 유사도를 구합니다 .
4. 가중치를
V 에 곱해줍니다 .
Attention mechanism
Fig. from Vaswani et al. Attention is all you need. ArXiv. 2017
정보 {K:V} 가 어떤 Q 와 연관이 있을 것입니다 .
이를 활용해서 K 와 Q 의 유사도를 구하고 이를 , V 에 반영해줍시다 .
그럼 Q 에 직접적으로 연관된 V 의 정보를 더 많이 전달해 줄 수 있을 것입
니다 .
2. 너무 큰 값이 지배적이지 않도록 normalize
3. 유사도 → 가중치 ( 총 합 =1)
1. Q 와 K 간의 유사도를 구합니다 .
4. 가중치를
V 에 곱해줍니다 .
e.g. Attention mechanism with Seq2Seq
...
Encoder
Decoder
...
Decoder 의 정보 전달은 오직 이
전 t 의 정보에 의존적입니다 .
Encoder 의 마지막 정보가
Decoder 로 전달됩니다 .
Encoder 의 정보 전달은 이전
t 의 hidden state, 현재 t 의
input 에 의존적입니다 .
(Machine translation, Encoder-Decoder, Attention)
e.g. Attention mechanism with Seq2Seq
(Machine translation, Encoder-Decoder, Attention)
⊕
...
Decoder
Encoder
...
Attention
Long-range dependency
e.g. Attention mechanism with Seq2Seq
Fig from Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR. 2015
(Machine translation, Encoder-Decoder, Attention)
Attention
e.g. Style-token
Fig. from Wang et al. Style-tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis.
ArXiv. 2018
Decoder
Encoder1
Encoder2
GST
(Random init token)
⊕
Attention
(Text to speech, Encoder-Decoder, Style transfer, Attention)
Demo: https://google.github.io/tacotron/publications/global_style_tokens/
Part 2. Self-attention
1
Self-attention
1 2 3
4 5 6
7 8 9
3
9
1
2
*
=
0.1
0.3
0.10.1
0.10.2
0.30.10.10.1
... ...
0.1
0.3
0.10.1
0.10.2
0.30.10.10.1
...
*
*
*
3
9
1
2
...
1’ 2’ 3’
4’ 5’ 6’
7’ 8’ 9’
1’⊕
Self-attention LayerSelf-attention Layer
2
Self-attention
1 2 3
4 5 6
7 8 9
3
9
1
2
*
=
0.1
0.1
0.10.1
0.10.2
0.30.10.10.2
... ...
0.1
0.1
0.10.1
0.10.2
0.30.10.10.2
...
*
*
*
3
9
1
2
...
⊕
1’ 2’ 3’
4’ 5’ 6’
7’ 8’ 9’
2’⊕
Self-attention Layer
Fig. from Wang et al. Non-local neural networks. ArXiv. 2017.
1. i, j pixel 간의 유사도를 구한다 .
Self-attention
Fig. from Wang et al. Non-local neural networks. ArXiv. 2017.
1. i, j pixel 간의 유사도를 구한다 .
2. j pixel 값을 곱한다 .
Self-attention
Fig. from Wang et al. Non-local neural networks. ArXiv. 2017.
1. i, j pixel 간의 유사도를 구한다 .
2. j pixel 값을 곱한다 .
3. normalization 항
Self-attention
Fig. from Wang et al. Non-local neural networks. ArXiv. 2017.
1. i, j pixel 간의 유사도를 구한다 .
i, j 번째 정보는 서로 연관이 있을 것입니다 .
각 위치 별 유사도를 구하고 이를 가중치로 반영해줍시다 .
그럼 , 모든 위치 별 관계를 학습 할 수 있을 것입니다 .
(Long-range dependency!)
2. j pixel 값을 곱한다 .
3. normalization 항
Self-attention
e.g. Self-Attention GAN
(Image generation, GAN, Self-attention)
Transpose
Conv ⊕
Latent
(z)
Image
(x’)
Self-
Attention
Conv ⊕ FC
Self-
Attention
ProbImage
(x)
Generator
Discriminator
Fig. from Zhang et al. Self-Attention Generative Adversarial Networks. ArXiv. 2018.
e.g. Self-Attention GAN
(Image generation, GAN, Self-attention)
Conclusion
Attention:
Self-Attention:
Next...?
Relational Network, Graphical Model...
Reference
- Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR. 2015
- Wang et al. Non-local neural networks. ArXiv. 2017
- Vaswani et al. Attention is all you need. ArXiv. 2017
- Wang et al. Style-tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End
Speech Synthesis. ArXiv. 2018
- Zhang et al. Self-Attention Generative Adversarial Networks. ArXiv. 2018.
- Attention is all you need 설명 블로그
(https://mchromiak.github.io/articles/2017/Sep/12/Transformer-Attention-is-all-you-need/)
- Attention is all you need 설명 동영상
(https://www.youtube.com/watch?v=iDulhoQ2pro)

Attention mechanism 소개 자료

  • 1.
    Attention is allyou need Whi Kwon
  • 2.
    소개 2008 ~ 2015:화공생명공학과 2015 ~ 2017: 품질 / 고객지원 엔지니어 2017 ~ 2018: 딥러닝 자유롭게 공부 2018 ~: 의료 분야 스타트업
  • 3.
    관심사 ~2017.12: Vision, NLP ~2018.06:RL, GAN 2018.06~: Relational, Imitation
  • 4.
  • 5.
  • 6.
    Attention, also referredto as enthrallment, is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether deemed subjective or objective, while ignoring other perceivable information. It is a state of arousal. . It is the taking possession by the mind in clear and vivid form of one out of what seem several simultaneous objects or trains of thought. Focalization, the concentration of consciousness, is of its essence. Attention or enthrallment or attention has also been described as the allocation of limited cognitive processing resources.
  • 7.
    Attention, also referredto as enthrallment, is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether deemed subjective or objective, while ignoring other perceivable information. It is a state of arousal. . It is the taking possession by the mind in clear and vivid form of one out of what seem several simultaneous objects or trains of thought. Focalization, the concentration of consciousness, is of its essence. Attention or enthrallment or attention has also been described as the allocation of limited cognitive processing resources.
  • 8.
    Attention, also referredto as enthrallment, is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether deemed subjective or objective, while ignoring other perceivable information. It is a state of arousal. It is the taking possession by the mind in clear and vivid form of one out of what seem several simultaneous objects or trains of thought. Focalization, the concentration of consciousness, is of its essence. Attention or enthrallment or attention has also been described as the allocation of limited cognitive processing resources. Recurrent Neural Network ... attention also referred resources 문제 : Non-parallel computation, not long-range dependencies
  • 9.
    Attention, also referredto as enthrallment, is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether deemed subjective or objective, while ignoring other perceivable information. It is a state of arousal. It is the taking possession by the mind in clear and vivid form of one out of what seem several simultaneous objects or trains of thought. Focalization, the concentration of consciousness, is of its essence. Attention or enthrallment or attention has also been described as the allocation of limited cognitive processing resources. Convolution Neural Network attention also ... cognitive process of selectively ... whether deemed . . . . . . . . . . . . . . . concentrat of ... enthrallment or attention has ... cognitive processing Filter 문제 : Not long-range dependencies, computationally inefficient
  • 10.
    Attention, also referredto as enthrallment, is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether deemed subjective or objective, while ignoring other perceivable information. It is a state of arousal. . It is the taking possession by the mind in clear and vivid form of one out of what seem several simultaneous objects or trains of thought. Focalization, the concentration of consciousness, is of its essence. Attention or enthrallment or attention has also been described as the allocation of limited cognitive processing resources. Attention mechanism Parallel computation, long-range dependencies, explainable
  • 11.
    Attention mechanism Fig. fromVaswani et al. Attention is all you need. ArXiv. 2017 1. Q 와 K 간의 유사도를 구합니다 .
  • 12.
    Attention mechanism Fig. fromVaswani et al. Attention is all you need. ArXiv. 2017 2. 너무 큰 값이 지배적이지 않도록 normalize 1. Q 와 K 간의 유사도를 구합니다 .
  • 13.
    Attention mechanism Fig. fromVaswani et al. Attention is all you need. ArXiv. 2017 2. 너무 큰 값이 지배적이지 않도록 normalize 1. Q 와 K 간의 유사도를 구합니다 . 3. 유사도 → 가중치 ( 총 합 =1)
  • 14.
    Attention mechanism Fig. fromVaswani et al. Attention is all you need. ArXiv. 2017 2. 너무 큰 값이 지배적이지 않도록 normalize 3. 유사도 → 가중치 ( 총 합 =1) 1. Q 와 K 간의 유사도를 구합니다 . 4. 가중치를 V 에 곱해줍니다 .
  • 15.
    Attention mechanism Fig. fromVaswani et al. Attention is all you need. ArXiv. 2017 정보 {K:V} 가 어떤 Q 와 연관이 있을 것입니다 . 이를 활용해서 K 와 Q 의 유사도를 구하고 이를 , V 에 반영해줍시다 . 그럼 Q 에 직접적으로 연관된 V 의 정보를 더 많이 전달해 줄 수 있을 것입 니다 . 2. 너무 큰 값이 지배적이지 않도록 normalize 3. 유사도 → 가중치 ( 총 합 =1) 1. Q 와 K 간의 유사도를 구합니다 . 4. 가중치를 V 에 곱해줍니다 .
  • 16.
    e.g. Attention mechanismwith Seq2Seq ... Encoder Decoder ... Decoder 의 정보 전달은 오직 이 전 t 의 정보에 의존적입니다 . Encoder 의 마지막 정보가 Decoder 로 전달됩니다 . Encoder 의 정보 전달은 이전 t 의 hidden state, 현재 t 의 input 에 의존적입니다 . (Machine translation, Encoder-Decoder, Attention)
  • 17.
    e.g. Attention mechanismwith Seq2Seq (Machine translation, Encoder-Decoder, Attention) ⊕ ... Decoder Encoder ... Attention Long-range dependency
  • 18.
    e.g. Attention mechanismwith Seq2Seq Fig from Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR. 2015 (Machine translation, Encoder-Decoder, Attention) Attention
  • 19.
    e.g. Style-token Fig. fromWang et al. Style-tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. ArXiv. 2018 Decoder Encoder1 Encoder2 GST (Random init token) ⊕ Attention (Text to speech, Encoder-Decoder, Style transfer, Attention) Demo: https://google.github.io/tacotron/publications/global_style_tokens/
  • 20.
  • 21.
    1 Self-attention 1 2 3 45 6 7 8 9 3 9 1 2 * = 0.1 0.3 0.10.1 0.10.2 0.30.10.10.1 ... ... 0.1 0.3 0.10.1 0.10.2 0.30.10.10.1 ... * * * 3 9 1 2 ... 1’ 2’ 3’ 4’ 5’ 6’ 7’ 8’ 9’ 1’⊕ Self-attention LayerSelf-attention Layer
  • 22.
    2 Self-attention 1 2 3 45 6 7 8 9 3 9 1 2 * = 0.1 0.1 0.10.1 0.10.2 0.30.10.10.2 ... ... 0.1 0.1 0.10.1 0.10.2 0.30.10.10.2 ... * * * 3 9 1 2 ... ⊕ 1’ 2’ 3’ 4’ 5’ 6’ 7’ 8’ 9’ 2’⊕ Self-attention Layer
  • 23.
    Fig. from Wanget al. Non-local neural networks. ArXiv. 2017. 1. i, j pixel 간의 유사도를 구한다 . Self-attention
  • 24.
    Fig. from Wanget al. Non-local neural networks. ArXiv. 2017. 1. i, j pixel 간의 유사도를 구한다 . 2. j pixel 값을 곱한다 . Self-attention
  • 25.
    Fig. from Wanget al. Non-local neural networks. ArXiv. 2017. 1. i, j pixel 간의 유사도를 구한다 . 2. j pixel 값을 곱한다 . 3. normalization 항 Self-attention
  • 26.
    Fig. from Wanget al. Non-local neural networks. ArXiv. 2017. 1. i, j pixel 간의 유사도를 구한다 . i, j 번째 정보는 서로 연관이 있을 것입니다 . 각 위치 별 유사도를 구하고 이를 가중치로 반영해줍시다 . 그럼 , 모든 위치 별 관계를 학습 할 수 있을 것입니다 . (Long-range dependency!) 2. j pixel 값을 곱한다 . 3. normalization 항 Self-attention
  • 27.
    e.g. Self-Attention GAN (Imagegeneration, GAN, Self-attention) Transpose Conv ⊕ Latent (z) Image (x’) Self- Attention Conv ⊕ FC Self- Attention ProbImage (x) Generator Discriminator
  • 28.
    Fig. from Zhanget al. Self-Attention Generative Adversarial Networks. ArXiv. 2018. e.g. Self-Attention GAN (Image generation, GAN, Self-attention)
  • 29.
  • 30.
  • 31.
    Reference - Bahdanau etal. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR. 2015 - Wang et al. Non-local neural networks. ArXiv. 2017 - Vaswani et al. Attention is all you need. ArXiv. 2017 - Wang et al. Style-tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. ArXiv. 2018 - Zhang et al. Self-Attention Generative Adversarial Networks. ArXiv. 2018. - Attention is all you need 설명 블로그 (https://mchromiak.github.io/articles/2017/Sep/12/Transformer-Attention-is-all-you-need/) - Attention is all you need 설명 동영상 (https://www.youtube.com/watch?v=iDulhoQ2pro)