[Paper Reading] Attention is All You NeedDaiki Tanaka
The document summarizes the "Attention Is All You Need" paper, which introduced the Transformer model for natural language processing. The Transformer uses attention mechanisms rather than recurrent or convolutional layers, allowing for more parallelization. It achieved state-of-the-art results in machine translation tasks using techniques like multi-head attention, positional encoding, and beam search decoding. The paper demonstrated the Transformer's ability to draw global dependencies between input and output with constant computational complexity.
Slides reviewing the paper:
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in Neural Information Processing Systems, pp. 6000-6010. 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
The document discusses different types of attention mechanisms used in neural machine translation and image captioning models. It describes global attention which considers all encoder hidden states when deriving context vectors, and local attention which selectively focuses on a small window of context. Hard attention selects a single location to focus on, while soft attention takes a weighted average over locations. The document also discusses input feeding which makes the model aware of previous alignment choices.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Generative model is nowadays a very good tool for Anomaly Detection. Thus I bring a interesting generative model 'Diffusion' for solving the anomaly detection task. Presentation consists of the concept of diffusion and method to use diffusion for anomaly detection.
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...홍배 김
InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
오사카 대학 박사과정인 Takato Horii군이 작성한 자료
데이터 생성 모델로 우수한 GAN을 이용하여 비지도학습을 통해
"알기쉬게" 이미지의 정보를 표현하는 특징량을 "간단하게"획득하기
* 특징이 서로 얽혀있는 Physical space에서 서로 독립적인 Eigen space로 변환하는 것과 같은 원리
1) The document discusses different types of attention mechanisms in CNNs including self-attention and simplified attention for recalibration.
2) It reviews the evolution of CNN architectures including AlexNet, VGG, ResNet and variants, DenseNet, ResNeXt, Xception, MobileNet and ShuffleNet.
3) These attention mechanisms and CNN architectures are applied to tasks like image recognition, machine translation and image captioning.
[Paper Reading] Attention is All You NeedDaiki Tanaka
The document summarizes the "Attention Is All You Need" paper, which introduced the Transformer model for natural language processing. The Transformer uses attention mechanisms rather than recurrent or convolutional layers, allowing for more parallelization. It achieved state-of-the-art results in machine translation tasks using techniques like multi-head attention, positional encoding, and beam search decoding. The paper demonstrated the Transformer's ability to draw global dependencies between input and output with constant computational complexity.
Slides reviewing the paper:
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in Neural Information Processing Systems, pp. 6000-6010. 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
The document discusses different types of attention mechanisms used in neural machine translation and image captioning models. It describes global attention which considers all encoder hidden states when deriving context vectors, and local attention which selectively focuses on a small window of context. Hard attention selects a single location to focus on, while soft attention takes a weighted average over locations. The document also discusses input feeding which makes the model aware of previous alignment choices.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Generative model is nowadays a very good tool for Anomaly Detection. Thus I bring a interesting generative model 'Diffusion' for solving the anomaly detection task. Presentation consists of the concept of diffusion and method to use diffusion for anomaly detection.
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...홍배 김
InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
오사카 대학 박사과정인 Takato Horii군이 작성한 자료
데이터 생성 모델로 우수한 GAN을 이용하여 비지도학습을 통해
"알기쉬게" 이미지의 정보를 표현하는 특징량을 "간단하게"획득하기
* 특징이 서로 얽혀있는 Physical space에서 서로 독립적인 Eigen space로 변환하는 것과 같은 원리
1) The document discusses different types of attention mechanisms in CNNs including self-attention and simplified attention for recalibration.
2) It reviews the evolution of CNN architectures including AlexNet, VGG, ResNet and variants, DenseNet, ResNeXt, Xception, MobileNet and ShuffleNet.
3) These attention mechanisms and CNN architectures are applied to tasks like image recognition, machine translation and image captioning.
This document provides an overview of deep learning and neural networks. It begins with definitions of machine learning, artificial intelligence, and the different types of machine learning problems. It then introduces deep learning, explaining that it uses neural networks with multiple layers to learn representations of data. The document discusses why deep learning works better than traditional machine learning for complex problems. It covers key concepts like activation functions, gradient descent, backpropagation, and overfitting. It also provides examples of applications of deep learning and popular deep learning frameworks like TensorFlow. Overall, the document gives a high-level introduction to deep learning concepts and techniques.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
The document summarizes the Transformer neural network model proposed in the paper "Attention is All You Need". The Transformer uses self-attention mechanisms rather than recurrent or convolutional layers. It achieves state-of-the-art results in machine translation by allowing the model to jointly attend to information from different representation subspaces. The key components of the Transformer include multi-head self-attention layers in the encoder and masked multi-head self-attention layers in the decoder. Self-attention allows the model to learn long-range dependencies in sequence data more effectively than RNNs.
[한국어] Neural Architecture Search with Reinforcement LearningKiho Suh
모두의연구소에서 발표했던 “Neural Architecture Search with Reinforcement Learning”이라는 논문발표 자료를 공유합니다. 머신러닝 개발 업무중 일부를 자동화하는 구글의 AutoML이 뭘하려는지 이 논문을 통해 잘 보여줍니다.
이 논문에서는 딥러닝 구조를 만드는 딥러닝 구조에 대해서 설명합니다. 800개의 GPU를 혹은 400개의 CPU를 썼고 State of Art 혹은 State of Art 바로 아래이지만 더 빠르고 더 작은 네트워크를 이것을 통해 만들었습니다. 이제 Feature Engineering에서 Neural Network Engineering으로 페러다임이 변했는데 이것의 첫 시도 한 논문입니다.
This document discusses neural network models for natural language processing tasks like machine translation. It describes how recurrent neural networks (RNNs) were used initially but had limitations in capturing long-term dependencies and parallelization. The encoder-decoder framework addressed some issues but still lost context. Attention mechanisms allowed focusing on relevant parts of the input and using all encoded states. Transformers replaced RNNs entirely with self-attention and encoder-decoder attention, allowing parallelization while generating a richer representation capturing word relationships. This revolutionized NLP tasks like machine translation.
This document describes DenseNets, a type of convolutional neural network architecture. DenseNets connect each layer to every other layer in a feed-forward fashion to encourage feature reuse and consolidate feature maps early in the network. This architecture improves information and gradient flow. The document outlines key DenseNet concepts like collective knowledge, compression layers, and growth rate. It also provides results comparing DenseNets to ResNet on CIFAR-10 and ImageNet datasets.
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
Conditional Image Generation with PixelCNN Decoderssuga93
The document summarizes research on conditional image generation using PixelCNN decoders. It discusses how PixelCNNs sequentially predict pixel values rather than the whole image at once. Previous work used PixelRNNs, but these were slow to train. The proposed approach uses a Gated PixelCNN that removes blind spots in the receptive field by combining horizontal and vertical feature maps. It also conditions PixelCNN layers on class labels or embeddings to generate conditional images. Experimental results show the Gated PixelCNN outperforms PixelCNN and achieves performance close to PixelRNN on CIFAR-10 and ImageNet, while training faster. It can also generate portraits conditioned on embeddings of people.
1. Recurrent neural networks can model sequential data like time series by incorporating hidden state that has internal dynamics. This allows the model to store information for long periods of time.
2. Two key types of recurrent networks are linear dynamical systems and hidden Markov models. Long short-term memory networks were developed to address the problem of exploding or vanishing gradients in training traditional recurrent networks.
3. Recurrent networks can learn tasks like binary addition by recognizing patterns in the inputs over time rather than relying on fixed architectures like feedforward networks. They have been successfully applied to handwriting recognition.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
The document describes the sequence-to-sequence (seq2seq) model with an encoder-decoder architecture. It explains that the seq2seq model uses two recurrent neural networks - an encoder RNN that processes the input sequence into a fixed-length context vector, and a decoder RNN that generates the output sequence from the context vector. It provides details on how the encoder, decoder, and training process work in the seq2seq model.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/09/introduction-to-dnn-model-compression-techniques-a-presentation-from-xailient/
Sabina Pokhrel, Customer Success AI Engineer at Xailient, presents the “Introduction to DNN Model Compression Techniques” tutorial at the May 2021 Embedded Vision Summit.
Embedding real-time large-scale deep learning vision applications at the edge is challenging due to their huge computational, memory, and bandwidth requirements. System architects can mitigate these demands by modifying deep-neural networks to make them more energy efficient and less demanding of processing resources by applying various model compression approaches.
In this talk, Pokhrel provides an introduction to four established techniques for model compression. She discusses network pruning, quantization, knowledge distillation and low-rank factorization compression approaches.
最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
Deep learning and neural networks are inspired by biological neurons. Artificial neural networks (ANN) can have multiple layers and learn through backpropagation. Deep neural networks with multiple hidden layers did not work well until recent developments in unsupervised pre-training of layers. Experiments on MNIST digit recognition and NORB object recognition datasets showed deep belief networks and deep Boltzmann machines outperform other models. Deep learning is now widely used for applications like computer vision, natural language processing, and information retrieval.
This document provides an overview of deep learning and neural networks. It begins with definitions of machine learning, artificial intelligence, and the different types of machine learning problems. It then introduces deep learning, explaining that it uses neural networks with multiple layers to learn representations of data. The document discusses why deep learning works better than traditional machine learning for complex problems. It covers key concepts like activation functions, gradient descent, backpropagation, and overfitting. It also provides examples of applications of deep learning and popular deep learning frameworks like TensorFlow. Overall, the document gives a high-level introduction to deep learning concepts and techniques.
http://imatge-upc.github.io/telecombcn-2016-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or text captioning.
The document summarizes the Transformer neural network model proposed in the paper "Attention is All You Need". The Transformer uses self-attention mechanisms rather than recurrent or convolutional layers. It achieves state-of-the-art results in machine translation by allowing the model to jointly attend to information from different representation subspaces. The key components of the Transformer include multi-head self-attention layers in the encoder and masked multi-head self-attention layers in the decoder. Self-attention allows the model to learn long-range dependencies in sequence data more effectively than RNNs.
[한국어] Neural Architecture Search with Reinforcement LearningKiho Suh
모두의연구소에서 발표했던 “Neural Architecture Search with Reinforcement Learning”이라는 논문발표 자료를 공유합니다. 머신러닝 개발 업무중 일부를 자동화하는 구글의 AutoML이 뭘하려는지 이 논문을 통해 잘 보여줍니다.
이 논문에서는 딥러닝 구조를 만드는 딥러닝 구조에 대해서 설명합니다. 800개의 GPU를 혹은 400개의 CPU를 썼고 State of Art 혹은 State of Art 바로 아래이지만 더 빠르고 더 작은 네트워크를 이것을 통해 만들었습니다. 이제 Feature Engineering에서 Neural Network Engineering으로 페러다임이 변했는데 이것의 첫 시도 한 논문입니다.
This document discusses neural network models for natural language processing tasks like machine translation. It describes how recurrent neural networks (RNNs) were used initially but had limitations in capturing long-term dependencies and parallelization. The encoder-decoder framework addressed some issues but still lost context. Attention mechanisms allowed focusing on relevant parts of the input and using all encoded states. Transformers replaced RNNs entirely with self-attention and encoder-decoder attention, allowing parallelization while generating a richer representation capturing word relationships. This revolutionized NLP tasks like machine translation.
This document describes DenseNets, a type of convolutional neural network architecture. DenseNets connect each layer to every other layer in a feed-forward fashion to encourage feature reuse and consolidate feature maps early in the network. This architecture improves information and gradient flow. The document outlines key DenseNet concepts like collective knowledge, compression layers, and growth rate. It also provides results comparing DenseNets to ResNet on CIFAR-10 and ImageNet datasets.
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
Conditional Image Generation with PixelCNN Decoderssuga93
The document summarizes research on conditional image generation using PixelCNN decoders. It discusses how PixelCNNs sequentially predict pixel values rather than the whole image at once. Previous work used PixelRNNs, but these were slow to train. The proposed approach uses a Gated PixelCNN that removes blind spots in the receptive field by combining horizontal and vertical feature maps. It also conditions PixelCNN layers on class labels or embeddings to generate conditional images. Experimental results show the Gated PixelCNN outperforms PixelCNN and achieves performance close to PixelRNN on CIFAR-10 and ImageNet, while training faster. It can also generate portraits conditioned on embeddings of people.
1. Recurrent neural networks can model sequential data like time series by incorporating hidden state that has internal dynamics. This allows the model to store information for long periods of time.
2. Two key types of recurrent networks are linear dynamical systems and hidden Markov models. Long short-term memory networks were developed to address the problem of exploding or vanishing gradients in training traditional recurrent networks.
3. Recurrent networks can learn tasks like binary addition by recognizing patterns in the inputs over time rather than relying on fixed architectures like feedforward networks. They have been successfully applied to handwriting recognition.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
The document describes the sequence-to-sequence (seq2seq) model with an encoder-decoder architecture. It explains that the seq2seq model uses two recurrent neural networks - an encoder RNN that processes the input sequence into a fixed-length context vector, and a decoder RNN that generates the output sequence from the context vector. It provides details on how the encoder, decoder, and training process work in the seq2seq model.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2021/09/introduction-to-dnn-model-compression-techniques-a-presentation-from-xailient/
Sabina Pokhrel, Customer Success AI Engineer at Xailient, presents the “Introduction to DNN Model Compression Techniques” tutorial at the May 2021 Embedded Vision Summit.
Embedding real-time large-scale deep learning vision applications at the edge is challenging due to their huge computational, memory, and bandwidth requirements. System architects can mitigate these demands by modifying deep-neural networks to make them more energy efficient and less demanding of processing resources by applying various model compression approaches.
In this talk, Pokhrel provides an introduction to four established techniques for model compression. She discusses network pruning, quantization, knowledge distillation and low-rank factorization compression approaches.
最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
Deep learning and neural networks are inspired by biological neurons. Artificial neural networks (ANN) can have multiple layers and learn through backpropagation. Deep neural networks with multiple hidden layers did not work well until recent developments in unsupervised pre-training of layers. Experiments on MNIST digit recognition and NORB object recognition datasets showed deep belief networks and deep Boltzmann machines outperform other models. Deep learning is now widely used for applications like computer vision, natural language processing, and information retrieval.
사내 스터디용으로 공부하며 만든 발표 자료입니다. 부족한 부분이 있을 수도 있으니 알려주시면 정정하도록 하겠습니다.
*슬라이드 6에 나오는 classical CNN architecture(뒤에도 계속 나옴)에서 ReLU - Pool - ReLu에서 뒤에 나오는 ReLU는 잘못된 표현입니다. ReLU - Pool에서 ReLU 계산을 또 하는 건 redundant 하기 때문입니다(Kyung Mo Kweon 피드백 감사합니다)
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...Gyubin Son
1. Eye in the Sky: Real-time Drone Surveillance System (DSS) for Violent Individuals Identification using ScatterNet Hybrid Deep Learning Network
https://arxiv.org/abs/1806.00746
2. 3D human pose estimation in video with temporal convolutions and semi-supervised training
https://arxiv.org/abs/1811.11742
네이버에서 진행한 NLP challenge에서 수상 후 발표한 자료입니다.
[대회링크] https://github.com/naver/nlp-challenge
-------------------------------------------------------------------------------
이신의 (lsnfamily02@yonsei.ac.kr)
박장원 (adieujw@gmail.com)
Slides based on "Introduction to Machine Learning with Python" by Andreas Muller and Sarah Guido for Hongdae Machine Learning Study(https://www.meetup.com/Hongdae-Machine-Learning-Study/) (epoch #2)
홍대 머신 러닝 스터디(https://www.meetup.com/Hongdae-Machine-Learning-Study/) (epoch #2)의 "파이썬 라이브러리를 활용한 머신러닝"(옮긴이 박해선) 슬라이드 자료.
1. Attention is All you Need
Ashish Vaswani et al.(2017)
한국 인공지능 연구소 이준호
2. • 기존에 사용되던 RNN / CNN 방식과 다른 FNN과 Skip
Connection 기반의 모듈을 N개 쌓아서 LSTM, GRU같은 아키텍
처 없이 long-term dependency를 효과적으로 해결한 새로운
인코더-디코더 방식
• Source 단어와 Output 사이 distance가 짧아 Attention으로 정
보가 직접적으로 연결
• RNN의 순차적 특성으로 병렬 처리가 힘들다는 단점을 극복해
서 약 30배 가량의 학습속도 개선
3. Motivation(Google Research Blog)
• “I arrived at the bank after crossing the …”
• I arrived at the bank after crossing the road (Bank 은행)
• I arrived at the bank after crossing the river (Bank 강둑)
RNN은 순차적 언어 처리 word distance 커질수록 분석을 위한 타임스탭
숫자가 증가.
CNN을 이용한 최신 언어 모델의 경우에도 word distance에 따른 단계(레이
어)가 증가한다.
5. Model Architecture
• Left side : Encoder
• Right side : Decoder
• 각 단계에서 자동 회귀를 실시
• 즉, 이전 단계에서 나온 symbol을 다음 단계에 추
가한다.
6. Encoder
• encode는 N = 6의 identical 레이어를 가진다.
• 각 레이어는 2개의 Sub 레이어를 동반한다.
• 처음은 Multi-head self attention mechanism
• 뒤 이어서 Fully connected된 feed-forward network가 온다.
• 모델에서 처리하는 기본 차원은 d_model = 512라고 가정
7. Decoder
• Encoder같이 N = 6의 identical 레이어를 가짐
• Decoder는 3개의 Sub레이어를 가진다.
• Encoder와 비슷하지만, input-feeding을 위해 디코더의
최종 출력이 다시 입력으로 들어간다.
8. • Input Embedding
pre-softmax linear transformation
text probability vector
• Positional Encoding
- 기본 조건 : d_model = 512 (to summation)
- use sine and cosine funtions
pos is position
i is the dimension
- RNN / CNN 방식을 사용하는 것이 아니기 때문에 time step의 개념을 보
존해야한다.
- sinusoidal 방법을 이용하면 inference시 긴 시퀀스에 대해서
‘extrapolate’(추론) 할 수 있게 될 뿐 아니라, 어떠한 offset k에 대해서도
PE_{pos + k} = a * PE_{pos} + b 형태로 표현 가능하다고 한다.... 어렵네요 ㅠ
9. Q : 디코더 이전 레이어 hidden state K : 인코더의 output key V : 인코더의 output value
출처 : http://dalpo0814.tistory.com/category/Machine%20Learning
10. Scaled Dot Product
Q와 K의 유사도를 계산
너무 큰 값이 지배적이지
않도록 스케일링
• Scale 과 normalize의 차이
Scale – mean 0, Var 1으로 조정, 주로 overflow & underflow를 방지
Normalize – 전체 구간을 설정, 데이터 군 내에서 특정 데이터의 위치를 확인
- 수식 : (요소값 – 최소값) / (최대값 – 최소값)
유사도 가중치 (총합 == 1)
가중치를 V와 곱연산
정리 : 이번 상태의 key와 value 페어인 {K, V}가 이전 상태인 Q와 어떤 연관이 있을것이다.
그러므로 K와 Q의 유사도를 계산하고 그걸 V에 반영
결국 V에 더 많은 Q의 정보가 전달된다.
11. Multi-Head Attention
• 지금까지 얘기했던 V, K, Q가 h개의 Scaled Dot-Product
를 통과하면서 h번의 선형 projection 뒤 결과들을 concat
해서 사용하면 더 좋다는 사실이 발견되었다.
Projection으로 인한 dimension decrease 때문에 계산량이
큰 폭으로 증가하지 않는다.
14. 마무리하며
• 페이퍼를 가지고 스터디 자료를 만들면서 최대한 많은 정보를
담으려고 노력했습니다만, 개인적인 부족함으로 100% 이해하
지 못한 부분이 있어서 정보 전달이 떨어지지 않을까 싶습니다
• 관련 페이퍼는 https://arxiv.org/abs/1706.03762에서 직접 확인
하실수 있습니다.
• Google Research Blog와
http://dalpo0814.tistory.com/category/Machine%20Learning를
열심히 참고했습니다.