An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Survey of Attention mechanism & Use in Computer VisionSwatiNarkhede1
This presentation contains the overview of Attention models. It also has information of the stand alone self attention model used for Computer Vision tasks.
Slides reviewing the paper:
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in Neural Information Processing Systems, pp. 6000-6010. 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
PR-409: Denoising Diffusion Probabilistic ModelsHyeongmin Lee
이번 논문은 요즘 핫한 Diffusion을 처음으로 유행시킨 Denoising Diffusion Probabilistic Models (DDPM) 입니다. ICML 2015년에 처음 제안된 Diffusion의 여러 실용적인 측면들을 멋지게 해결하여 그 유행의 시작을 알린 논문인데요, Generative Model의 여러 분야와 Diffusion, 그리고 DDPM에서는 무엇이 바뀌었는지 알아보도록 하겠습니다.
논문 링크: https://arxiv.org/abs/2006.11239
영상 링크: https://youtu.be/1j0W_lu55nc
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
This is the presentation from our AI Meet March 2017 on Attention Mechanism in Language Understanding and its Applications.
You can join Artifacia AI Meet Bangalore Group: https://www.meetup.com/Artifacia-AI-Meet/
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efficient GPU implemen-
tation of the convolution operation. To reduce overfitting in the fully-connected
layers we employed a recently-developed regularization method called “dropout”
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Week 9 the neural basis of consciousness : dissociation of consciousness &...Nao (Naotsugu) Tsuchiya
12-week lecture series on "the neural basis of consciousness" by Prof Nao Tsuchiya.
Given to 3rd year undergraduate level. No prerequisites.
Contents:
1) What are the logic and evidence of experiments which demonstrate dissociation between attention and consciousness?
2) How do they manipulate & assess consciousness?
3) How do they manipulate & assess attention?
Survey of Attention mechanism & Use in Computer VisionSwatiNarkhede1
This presentation contains the overview of Attention models. It also has information of the stand alone self attention model used for Computer Vision tasks.
Slides reviewing the paper:
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in Neural Information Processing Systems, pp. 6000-6010. 2017.
The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
PR-409: Denoising Diffusion Probabilistic ModelsHyeongmin Lee
이번 논문은 요즘 핫한 Diffusion을 처음으로 유행시킨 Denoising Diffusion Probabilistic Models (DDPM) 입니다. ICML 2015년에 처음 제안된 Diffusion의 여러 실용적인 측면들을 멋지게 해결하여 그 유행의 시작을 알린 논문인데요, Generative Model의 여러 분야와 Diffusion, 그리고 DDPM에서는 무엇이 바뀌었는지 알아보도록 하겠습니다.
논문 링크: https://arxiv.org/abs/2006.11239
영상 링크: https://youtu.be/1j0W_lu55nc
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
This is the presentation from our AI Meet March 2017 on Attention Mechanism in Language Understanding and its Applications.
You can join Artifacia AI Meet Bangalore Group: https://www.meetup.com/Artifacia-AI-Meet/
We trained a large, deep convolutional neural network to classify the 1.2 million
high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-
ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5%
and 17.0% which is considerably better than the previous state-of-the-art. The
neural network, which has 60 million parameters and 650,000 neurons, consists
of five convolutional layers, some of which are followed by max-pooling layers,
and three fully-connected layers with a final 1000-way softmax. To make train-
ing faster, we used non-saturating neurons and a very efficient GPU implemen-
tation of the convolution operation. To reduce overfitting in the fully-connected
layers we employed a recently-developed regularization method called “dropout”
that proved to be very effective. We also entered a variant of this model in the
ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%,
compared to 26.2% achieved by the second-best entry.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Week 9 the neural basis of consciousness : dissociation of consciousness &...Nao (Naotsugu) Tsuchiya
12-week lecture series on "the neural basis of consciousness" by Prof Nao Tsuchiya.
Given to 3rd year undergraduate level. No prerequisites.
Contents:
1) What are the logic and evidence of experiments which demonstrate dissociation between attention and consciousness?
2) How do they manipulate & assess consciousness?
3) How do they manipulate & assess attention?
Brains@Bay Meetup: The Effect of Sensorimotor Learning on the Learned Represe...Numenta
Most current deep neural networks learn from a static data set without active interaction with the world. We take a look at how learning through a closed loop between action and perception affects the representations learned in a DNN. We demonstrate how these representations are significantly different from DNNs that learn supervised or unsupervised from a static dataset without interaction. These representations are much sparser and encode meaningful content in an efficient way. Even an agent who learned without any external supervision, purely through curious interaction with the world, acquires encodings of the high dimensional visual input that enable the agent to recognize objects using only a handful of labeled examples. Our results highlight the capabilities that emerge from letting DNNs learn more similar to biological brains, though sensorimotor interaction with the world.
For more:
Week 8 : The neural basis of consciousness : consciousness vs. attention Nao (Naotsugu) Tsuchiya
12-week lecture series on "the neural basis of consciousness" by Prof Nao Tsuchiya.
Given to 3rd year undergraduate level. No prerequisites.
Contents:
1) How can we define “attention”?
2) What are the paradigms to manipulate attention?
3) What are the neuronal mechanisms of attention?
4) How can we explain the relationship between attention and consciousness?
Can Marketers Get to Grips with the Human Condition?Klaxon
On 20th October we explored how to employ neuroscience research techniques to drive marketing performance.
Our industry experts included:
Thom Noble, CEO, NeuroStrata
Mev Bertrand, Research Manager, Neuro-Insight
Will Nicholson, Managing Director, The Vision Network
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
6. Attention, also referred to as enthrallment, is the behavioral and cognitive process
of selectively concentrating on a discrete aspect of information, whether deemed
subjective or objective, while ignoring other perceivable information. It is a state of
arousal. . It is the taking possession by the mind in clear and vivid form of one out
of what seem several simultaneous objects or trains of thought. Focalization, the
concentration of consciousness, is of its essence. Attention or enthrallment or
attention has also been described as the allocation of limited cognitive processing
resources.
7. Attention, also referred to as enthrallment, is the behavioral and cognitive process
of selectively concentrating on a discrete aspect of information, whether deemed
subjective or objective, while ignoring other perceivable information. It is a state of
arousal. . It is the taking possession by the mind in clear and vivid form of one out
of what seem several simultaneous objects or trains of thought. Focalization, the
concentration of consciousness, is of its essence. Attention or enthrallment or
attention has also been described as the allocation of limited cognitive processing
resources.
8. Attention, also referred to as enthrallment, is the behavioral and cognitive process
of selectively concentrating on a discrete aspect of information, whether deemed
subjective or objective, while ignoring other perceivable information. It is a state of
arousal. It is the taking possession by the mind in clear and vivid form of one out of
what seem several simultaneous objects or trains of thought. Focalization, the
concentration of consciousness, is of its essence. Attention or enthrallment or
attention has also been described as the allocation of limited cognitive processing
resources.
Recurrent Neural Network
...
attention also referred resources
문제 : Non-parallel computation, not long-range dependencies
9. Attention, also referred to as enthrallment, is the behavioral and cognitive process
of selectively concentrating on a discrete aspect of information, whether deemed
subjective or objective, while ignoring other perceivable information. It is a state of
arousal. It is the taking possession by the mind in clear and vivid form of one out of
what seem several simultaneous objects or trains of thought. Focalization, the
concentration of consciousness, is of its essence. Attention or enthrallment or
attention has also been described as the allocation of limited cognitive processing
resources.
Convolution Neural Network
attention also ... cognitive process
of selectively ... whether deemed
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
concentrat of ... enthrallment or
attention has ... cognitive processing
Filter
문제 : Not long-range dependencies, computationally inefficient
10. Attention, also referred to as enthrallment, is the behavioral and cognitive process
of selectively concentrating on a discrete aspect of information, whether deemed
subjective or objective, while ignoring other perceivable information. It is a state of
arousal. . It is the taking possession by the mind in clear and vivid form of one out
of what seem several simultaneous objects or trains of thought. Focalization, the
concentration of consciousness, is of its essence. Attention or enthrallment or
attention has also been described as the allocation of limited cognitive processing
resources.
Attention mechanism
Parallel computation, long-range dependencies, explainable
12. Attention mechanism
Fig. from Vaswani et al. Attention is all you need. ArXiv. 2017
2. 너무 큰 값이 지배적이지 않도록 normalize
1. Q 와 K 간의 유사도를 구합니다 .
13. Attention mechanism
Fig. from Vaswani et al. Attention is all you need. ArXiv. 2017
2. 너무 큰 값이 지배적이지 않도록 normalize
1. Q 와 K 간의 유사도를 구합니다 .
3. 유사도 → 가중치 ( 총 합 =1)
14. Attention mechanism
Fig. from Vaswani et al. Attention is all you need. ArXiv. 2017
2. 너무 큰 값이 지배적이지 않도록 normalize
3. 유사도 → 가중치 ( 총 합 =1)
1. Q 와 K 간의 유사도를 구합니다 .
4. 가중치를
V 에 곱해줍니다 .
15. Attention mechanism
Fig. from Vaswani et al. Attention is all you need. ArXiv. 2017
정보 {K:V} 가 어떤 Q 와 연관이 있을 것입니다 .
이를 활용해서 K 와 Q 의 유사도를 구하고 이를 , V 에 반영해줍시다 .
그럼 Q 에 직접적으로 연관된 V 의 정보를 더 많이 전달해 줄 수 있을 것입
니다 .
2. 너무 큰 값이 지배적이지 않도록 normalize
3. 유사도 → 가중치 ( 총 합 =1)
1. Q 와 K 간의 유사도를 구합니다 .
4. 가중치를
V 에 곱해줍니다 .
16. e.g. Attention mechanism with Seq2Seq
...
Encoder
Decoder
...
Decoder 의 정보 전달은 오직 이
전 t 의 정보에 의존적입니다 .
Encoder 의 마지막 정보가
Decoder 로 전달됩니다 .
Encoder 의 정보 전달은 이전
t 의 hidden state, 현재 t 의
input 에 의존적입니다 .
(Machine translation, Encoder-Decoder, Attention)
17. e.g. Attention mechanism with Seq2Seq
(Machine translation, Encoder-Decoder, Attention)
⊕
...
Decoder
Encoder
...
Attention
Long-range dependency
18. e.g. Attention mechanism with Seq2Seq
Fig from Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR. 2015
(Machine translation, Encoder-Decoder, Attention)
Attention
19. e.g. Style-token
Fig. from Wang et al. Style-tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis.
ArXiv. 2018
Decoder
Encoder1
Encoder2
GST
(Random init token)
⊕
Attention
(Text to speech, Encoder-Decoder, Style transfer, Attention)
Demo: https://google.github.io/tacotron/publications/global_style_tokens/
23. Fig. from Wang et al. Non-local neural networks. ArXiv. 2017.
1. i, j pixel 간의 유사도를 구한다 .
Self-attention
24. Fig. from Wang et al. Non-local neural networks. ArXiv. 2017.
1. i, j pixel 간의 유사도를 구한다 .
2. j pixel 값을 곱한다 .
Self-attention
25. Fig. from Wang et al. Non-local neural networks. ArXiv. 2017.
1. i, j pixel 간의 유사도를 구한다 .
2. j pixel 값을 곱한다 .
3. normalization 항
Self-attention
26. Fig. from Wang et al. Non-local neural networks. ArXiv. 2017.
1. i, j pixel 간의 유사도를 구한다 .
i, j 번째 정보는 서로 연관이 있을 것입니다 .
각 위치 별 유사도를 구하고 이를 가중치로 반영해줍시다 .
그럼 , 모든 위치 별 관계를 학습 할 수 있을 것입니다 .
(Long-range dependency!)
2. j pixel 값을 곱한다 .
3. normalization 항
Self-attention
27. e.g. Self-Attention GAN
(Image generation, GAN, Self-attention)
Transpose
Conv ⊕
Latent
(z)
Image
(x’)
Self-
Attention
Conv ⊕ FC
Self-
Attention
ProbImage
(x)
Generator
Discriminator
28. Fig. from Zhang et al. Self-Attention Generative Adversarial Networks. ArXiv. 2018.
e.g. Self-Attention GAN
(Image generation, GAN, Self-attention)
31. Reference
- Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR. 2015
- Wang et al. Non-local neural networks. ArXiv. 2017
- Vaswani et al. Attention is all you need. ArXiv. 2017
- Wang et al. Style-tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End
Speech Synthesis. ArXiv. 2018
- Zhang et al. Self-Attention Generative Adversarial Networks. ArXiv. 2018.
- Attention is all you need 설명 블로그
(https://mchromiak.github.io/articles/2017/Sep/12/Transformer-Attention-is-all-you-need/)
- Attention is all you need 설명 동영상
(https://www.youtube.com/watch?v=iDulhoQ2pro)