Olga Petrova gives an introduction to transformers for natural language processing (NLP). She begins with an overview of representing words using tokenization, word embeddings, and one-hot encodings. Recurrent neural networks (RNNs) are discussed as they are important for modeling sequential data like text, but they struggle with long-term dependencies. Attention mechanisms were developed to address this by allowing the model to focus on relevant parts of the input. Transformers use self-attention and have achieved state-of-the-art results in many NLP tasks. Bidirectional Encoder Representations from Transformers (BERT) provides contextualized word embeddings trained on large corpora.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
Fine tune and deploy Hugging Face NLP modelsOVHcloud
Are you currently managing AI projects that require a lot of GPU power?
Are you tired of managing the complexity of your infrastructures, GPU instances and your Kubeflow yourself?
Need flexibility for your AI platform or SaaS solution?
OVHcloud innovates in AI by offering simple and turnkey solutions to train your models and put them into production.
Brief introduction on attention mechanism and its application in neural machine translation, especially in transformer, where attention was used to remove RNNs completely from NMT.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
Fine tune and deploy Hugging Face NLP modelsOVHcloud
Are you currently managing AI projects that require a lot of GPU power?
Are you tired of managing the complexity of your infrastructures, GPU instances and your Kubeflow yourself?
Need flexibility for your AI platform or SaaS solution?
OVHcloud innovates in AI by offering simple and turnkey solutions to train your models and put them into production.
Brief introduction on attention mechanism and its application in neural machine translation, especially in transformer, where attention was used to remove RNNs completely from NMT.
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
These slides are an introduction to the understanding of the domain NLP and the basic NLP pipeline that are commonly used in the field of Computational Linguistics.
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Fwdays
In this talk I'll start by introducing the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures. The second part of the talk will be dedicated to an introduction of the open-source tools released by Hugging Face, in particular our transformers, tokenizers, and NLP libraries as well as our distilled and pruned models.
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
Continuous representations of words and documents, which is recently referred to as Word Embeddings, have recently demonstrated large advancements in many of the Natural language processing tasks.
In this presentation we will provide an introduction to the most common methods of learning these representations. As well as previous methods in building these representations before the recent advances in deep learning, such as dimensionality reduction on the word co-occurrence matrix.
Moreover, we will present the continuous bag of word model (CBOW), one of the most successful models for word embeddings and one of the core models in word2vec, and in brief a glance of many other models of building representations for other tasks such as knowledge base embeddings.
Finally, we will motivate the potential of using such embeddings for many tasks that could be of importance for the group, such as semantic similarity, document clustering and retrieval.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Over the last two years, the field of Natural Language Processing (NLP) has witnessed the emergence of transfer learning methods and architectures which significantly improved upon the state-of-the-art on pretty much every NLP tasks.
The wide availability and ease of integration of these transfer learning models are strong indicators that these methods will become a common tool in the NLP landscape as well as a major research direction.
In this talk, I'll present a quick overview of modern transfer learning methods in NLP and review examples and case studies on how these models can be integrated and adapted in downstream NLP tasks, focusing on open-source solutions.
Website: https://fwdays.com/event/data-science-fwdays-2019/review/transfer-learning-in-nlp
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
This is the presentation from our AI Meet March 2017 on Attention Mechanism in Language Understanding and its Applications.
You can join Artifacia AI Meet Bangalore Group: https://www.meetup.com/Artifacia-AI-Meet/
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
** NLP Using Python: - https://www.edureka.co/python-natural-language-processing-course **
This Edureka PPT will provide you with a comprehensive and detailed knowledge of Natural Language Processing, popularly known as NLP. You will also learn about the different steps involved in processing the human language like Tokenization, Stemming, Lemmatization and much more along with a demo on each one of the topics.
The following topics covered in this PPT:
1. The Evolution of Human Language
2. What is Text Mining?
3. What is Natural Language Processing?
4. Applications of NLP
5. NLP Components and Demo
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.
"Attention Is All You Need" Grazie a queste semplici parole, nel 2017 il Deep Learning ha subito un profondo cambiamento. I Transformers, inizialmente introdotti nel campo del Natural Language Processing, si sono recentemente dimostrati estremamente efficaci anche al di fuori di questo settore, ottenendo un enorme - e forse inaspettato - successo nel campo della Computer Vision. I Vision Transformers e moltissime delle sue varianti stanno ridefinendo oggi lo stato dell'arte su molti task di visione artificiale, dalla classificazione di immagini fino ai sistemi di visione per la guida autonoma. Ma cosa sono i Transformers? In che cosa consiste il meccanismo della self-attention che è alla base del loro funzionamento? Quali sono i suoi limiti? Saranno in grado di rimpiazzare le famose reti convoluzionali che hanno, a loro tempo, rivoluzionato la Computer Vision? In questo talk cercheremo di rispondere a tutte queste domande, offrendo un'ampia panoramica sulle idee fondanti, sulle architetture Transformer più utilizzate, e sulle applicazioni più promettenti.
Introductory seminar on NLP for CS sophomores. Presented to Texas A&M's Fall 2022 CSCE181 class. Slides are a bit redundant due to compatibility issues :\
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
These slides are an introduction to the understanding of the domain NLP and the basic NLP pipeline that are commonly used in the field of Computational Linguistics.
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Fwdays
In this talk I'll start by introducing the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures. The second part of the talk will be dedicated to an introduction of the open-source tools released by Hugging Face, in particular our transformers, tokenizers, and NLP libraries as well as our distilled and pruned models.
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
Continuous representations of words and documents, which is recently referred to as Word Embeddings, have recently demonstrated large advancements in many of the Natural language processing tasks.
In this presentation we will provide an introduction to the most common methods of learning these representations. As well as previous methods in building these representations before the recent advances in deep learning, such as dimensionality reduction on the word co-occurrence matrix.
Moreover, we will present the continuous bag of word model (CBOW), one of the most successful models for word embeddings and one of the core models in word2vec, and in brief a glance of many other models of building representations for other tasks such as knowledge base embeddings.
Finally, we will motivate the potential of using such embeddings for many tasks that could be of importance for the group, such as semantic similarity, document clustering and retrieval.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Over the last two years, the field of Natural Language Processing (NLP) has witnessed the emergence of transfer learning methods and architectures which significantly improved upon the state-of-the-art on pretty much every NLP tasks.
The wide availability and ease of integration of these transfer learning models are strong indicators that these methods will become a common tool in the NLP landscape as well as a major research direction.
In this talk, I'll present a quick overview of modern transfer learning methods in NLP and review examples and case studies on how these models can be integrated and adapted in downstream NLP tasks, focusing on open-source solutions.
Website: https://fwdays.com/event/data-science-fwdays-2019/review/transfer-learning-in-nlp
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
This is the presentation from our AI Meet March 2017 on Attention Mechanism in Language Understanding and its Applications.
You can join Artifacia AI Meet Bangalore Group: https://www.meetup.com/Artifacia-AI-Meet/
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
** NLP Using Python: - https://www.edureka.co/python-natural-language-processing-course **
This Edureka PPT will provide you with a comprehensive and detailed knowledge of Natural Language Processing, popularly known as NLP. You will also learn about the different steps involved in processing the human language like Tokenization, Stemming, Lemmatization and much more along with a demo on each one of the topics.
The following topics covered in this PPT:
1. The Evolution of Human Language
2. What is Text Mining?
3. What is Natural Language Processing?
4. Applications of NLP
5. NLP Components and Demo
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
BERT was developed by Google AI Language and came out Oct. 2018. It has achieved the best performance in many NLP tasks. So if you are interested in NLP, studying BERT is a good way to go.
"Attention Is All You Need" Grazie a queste semplici parole, nel 2017 il Deep Learning ha subito un profondo cambiamento. I Transformers, inizialmente introdotti nel campo del Natural Language Processing, si sono recentemente dimostrati estremamente efficaci anche al di fuori di questo settore, ottenendo un enorme - e forse inaspettato - successo nel campo della Computer Vision. I Vision Transformers e moltissime delle sue varianti stanno ridefinendo oggi lo stato dell'arte su molti task di visione artificiale, dalla classificazione di immagini fino ai sistemi di visione per la guida autonoma. Ma cosa sono i Transformers? In che cosa consiste il meccanismo della self-attention che è alla base del loro funzionamento? Quali sono i suoi limiti? Saranno in grado di rimpiazzare le famose reti convoluzionali che hanno, a loro tempo, rivoluzionato la Computer Vision? In questo talk cercheremo di rispondere a tutte queste domande, offrendo un'ampia panoramica sulle idee fondanti, sulle architetture Transformer più utilizzate, e sulle applicazioni più promettenti.
Introductory seminar on NLP for CS sophomores. Presented to Texas A&M's Fall 2022 CSCE181 class. Slides are a bit redundant due to compatibility issues :\
Presented by Ted Xiao at RobotXSpace on 4/18/2017. This workshop covers the fundamentals of Natural Language Processing, crucial NLP approaches, and an overview of NLP in industry.
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
This presentation delves into the world of Natural Language Processing (NLP), exploring its goal to make human language understandable to machines. The complexities of language, such as ambiguity and complex structures, are highlighted as major challenges. The talk underscores the evolution of NLP through deep learning methodologies, leading to a new era defined by large-scale language models. However, obstacles like low-resource languages and ethical issues including bias and hallucination are acknowledged as enduring challenges in the field. Overall, the presentation provides a condensed, yet comprehensive view of NLP's accomplishments and ongoing hurdles.
Recurrent Neural Networks hold great promise as general sequence learning algorithms. As such, they are a very promising tool for text analysis. However, outside of very specific use cases such as handwriting recognition and recently, machine translation, they have not seen wide spread use. Why has this been the case?
In this presentation, we will first introduce RNNs as a concept. Then we will sketch how to implement them and cover the tricks necessary to make them work well. With the basics covered, we will investigate using RNNs as general text classification and regression models, examining where they succeed and where they fail compared to more traditional text analysis models. A straightforward open-source Python and Theano library for training RNNs with a scikit-learn style interface will be introduced and we’ll see how to use it through a tutorial on a real world text dataset
DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0Plain Concepts
In this session we will explore Recurrent Neural Networks (RNN) - a type of neural networks specially designed to process sequences - and their applications to time series and text processing (NLP). To make the session even more interesting, all the code will be developed using the latest version of TensorFlow 2.0, using the implementation of the models to discuss the major changes with respect to versions 1.x of the Deep Learning framework, and it will leverage MLFLlow within Azure Databricks as a development platform and model serving.
Recipe2Vec: Or how does my robot know what’s tastyPyData
By Meghan Heintz
PyData New York City 2017
Your user knows they want a healthyish but tasty pasta for dinner but aren't quite sure exactly which recipe to choose. How can you help narrow their search and show them closely related recipes to give them enough options without making their search exhausting? This talk will show you BuzzFeed/Tasty tech's solution to creating a consistent method for finding similar Tasty recipes using word2vec.
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
2. Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
3. Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
4. Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
○ Quantum physicist at École Normale Supérieure, the
Max Planck Institute, Johns Hopkins University
5. Preliminaries
● Who I am:
○ Product Manager for AI PaaS at Scaleway
Scaleway: European cloud provider, originating from
France
○ Machine Learning engineer at Scaleway
○ Quantum physicist at École Normale Supérieure, the
Max Planck Institute, Johns Hopkins University
www.olgapaints.net
6. Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
7. Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
10. Representing words
I play with my dog.
● Machine learning: inputs are vectors / matrices
NLP 101 RNNs Attention Transformers BERT
11. Representing words
I play with my dog.
● Machine learning: inputs are vectors / matrices
ID Age Gender Weight Diagnosis
264264 27 0 72 1
908696 61 1 80 0
NLP 101 RNNs Attention Transformers BERT
12. Representing words
ID Age Gender Weight Diagnosis
264264 27 0 72 1
908696 61 1 80 0
I play with my dog.
● Machine learning: inputs are vectors / matrices
● How do we convert “I play with my dog” into numerical values?
NLP 101 RNNs Attention Transformers BERT
13. Representing words
I play with my dog.
I
play
with
my
dog
.
NLP 101 RNNs Attention Transformers BERT
14. Representing words
I play with my dog.
I
play
with
my
dog
.
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
● One hot encoding
● Dimensionality = vocabulary size
1
0
0
0
0
0
1
2
3
4
5
6
NLP 101 RNNs Attention Transformers BERT
15. Representing words
I play with my dog.
I play with my cat.
I
play
with
my
dog
.
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
● One hot encoding
● Dimensionality = vocabulary size
1
2
3
4
5
6
NLP 101 RNNs Attention Transformers BERT
16. Representing words
I play with my dog.
I play with my cat.
I
play
with
my
dog
.
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
● One hot encoding
● Dimensionality = vocabulary size
cat
0
0
0
0
0
0
1
1
2
3
4
5
6
7
NLP 101 RNNs Attention Transformers BERT
17. play played playing
I play / played / was playing with my dog.
Word = token: , ,
Subword = token: , ,
E.g.: played = ;
● In practice, subword tokenization is used most often
● For simplicity, we’ll go with word = token for the rest of the talk
Tokenization
play #ed #ing
play #ed
NLP 101 RNNs Attention Transformers BERT
18. Word embeddings
= …… N = …… = …… = ……
One hot encodings N = vocabulary size
● puppy : dog = kitten : cat. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
19. Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
20. Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
NLP 101 RNNs Attention Transformers BERT
21. Word embeddings
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
0
1
cat
0
0
dog
0
0
1
0
0
0
0
1
kitten
0
0
1
0
puppy
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
NLP 101 RNNs Attention Transformers BERT
22. Word embeddings
What do you mean, “meaningful way”?
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
NLP 101 RNNs Attention Transformers BERT
23. Word embeddings
What do you mean, “meaningful way”?
= …… N = …… = …… = ……
One hot encodings
● dog : puppy = cat : kitten. How can we record this?
● Goal: a lower-dimensional representation of a word (M << N)
● Words are forced to “group together” in a meaningful way
1
0
1
0
1
1
1
0
0
0
1
0
= M = = =
cat dog kitten puppy
1 for feline
1 for canine - = -
1 for adult / 0 for baby
dog puppy cat kitten
NLP 101 RNNs Attention Transformers BERT
24. How do we arrive at this representation?
Word embeddings
NLP 101 RNNs Attention Transformers BERT
25. How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
Word embeddings
NLP 101 RNNs Attention Transformers BERT
26. How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
NLP 101 RNNs Attention Transformers BERT
27. How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
● Some information gets lost in the process
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
NLP 101 RNNs Attention Transformers BERT
28. How do we arrive at this representation?
● Pass the one-hot vector through a network with a bottleneck layer
= multiply the N-dim vector by NxM matrix to get an M-dim vector
● Some information gets lost in the process
● Embedding matrix: through machine learning!
Word embeddings
Embedding matrix W
One-hot vector
Embedded
vector
In practice:
● Embedding can be part of your
model that you train for a ML task
● Or: use pre-trained embeddings
(e.g. Word2Vec, GloVe, etc)
NLP 101 RNNs Attention Transformers BERT
30. Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
31. Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
32. Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
33. Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
● Language modeling, question answering, machine translation, etc
Machine
Translation
I am a student
je suis étudiant
34. Sequence to Sequence models
● Not all seq2seq models are NLP
● Not all NLP models are seq2seq!
● But: they are common enough
NLP 101 RNNs Attention Transformers BERT
ML model Sequence 2
Sequence 1
● Language modeling, question answering, machine translation, etc
Machine
Translation
I am a student
je suis étudiant
Takeaways:
1. Multiple elements in a sequence
2. Order of the elements matters
40. RNNs: the seq2seq example
Output sequence
Input sequence
NLP 101 RNNs Attention Transformers BERT
41. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
42. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
43. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
44. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
45. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
46. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
47. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
48. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
Temporal direction (remember the Wikipedia
definition?)
49. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Context vector: contains the
meaning of the whole input
sequence
Inputs (word embeddings)
Temporal direction (remember the Wikipedia
definition?)
50. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
51. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
52. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
53. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
54. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
55. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
56. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
57. RNNs: the seq2seq example
NLP 101 RNNs Attention Transformers BERT
Outputs (one-hot)
Inputs (word embeddings)
59. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
NLP 101 RNNs Attention Transformers BERT
60. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
NLP 101 RNNs Attention Transformers BERT
61. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
NLP 101 RNNs Attention Transformers BERT
62. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
NLP 101 RNNs Attention Transformers BERT
63. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.”
NLP 101 RNNs Attention Transformers BERT
64. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
NLP 101 RNNs Attention Transformers BERT
65. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
“It was a wrong number that started it, the telephone ringing three times in the dead
of night, and the voice on the other end asking for someone he was not.”
NLP 101 RNNs Attention Transformers BERT
66. Does this really work?
Yes! (For the most part) Google Translate has been using these since ~ 2016
What is the catch?
● Catch # 1: Recurrent = Non-parallelizable 🐌
● Catch # 2: Relying on a fixed-length context vector to capture the full input
sequence. What can go wrong?
“I am a student.” ✅
“It was a wrong number that started it, the telephone ringing three times in the dead
of night, and the voice on the other end asking for someone he was not.” 😱
NLP 101 RNNs Attention Transformers BERT
67. Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
71. The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
Send all of the hidden states to the Decoder, not just
the last one
72. The Attention Mechanism
NLP 101 RNNs Attention Transformers BERT
Recurrent NN from the previous section
Send all of the hidden states to the Decoder, not just
the last one
The Decoder determines which input elements get
attended to via a softmax layer
73. Further reading
NLP 101 RNNs Attention Transformers BERT
jalammar.github.io/illustrated-transformer/
My blog post that this talk is based on + the linear algebra perspective on
Attention:
towardsdatascience.com/natural-language-processing-the-age-of-transformers-a3
6c0265937d
74. Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
82. Attention all around
NLP 101 RNNs Attention Transformers BERT
Word embedding + position encoding
83. Attention all around
● All of the sequence elements (w1, w2,
w3) are being processed in parallel
● Each element’s encoding gets
information (through hidden states)
from other elements
● Bidirectional by design 😍
NLP 101 RNNs Attention Transformers BERT
84. Why Self-Attention?
● “The animal did not cross the road
because it was too tired.”
vs.
● “The animal did not cross the road
because it was too wide.”
NLP 101 RNNs Attention Transformers BERT
85. Why Self-Attention?
● “The animal did not cross the road
because it was too tired.”
vs.
● “The animal did not cross the road
because it was too wide.”
Uni-directional RNN would have missed
this! Bi-directional is nice.
NLP 101 RNNs Attention Transformers BERT
86.
87. Outline
1. NLP 101: how do we represent words in machine learning
🤖 Tokenization 🤖 Word embeddings
2. Order matters: Recurrent Neural Networks
🤖 RNNs 🤖 Seq2Seq models
🤖 What is wrong with RNNs
3. Attention is all you need!
🤖 What is Attention 🤖 Attention, the linear algebra perspective
4. The Transformer architecture
🤖 Mostly Attention
5. BERT: contextualized word embeddings
88. (Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
NLP 101 RNNs Attention Transformers BERT
89. (Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
NLP 101 RNNs Attention Transformers BERT
90. (Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
NLP 101 RNNs Attention Transformers BERT
91. (Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
● Can we just take the Transformer Encoder and use it to embed words?
NLP 101 RNNs Attention Transformers BERT
92. (Contextual) word embeddings
● Dense (i.e. not sparse) vectors capturing the meaning of tokens/words
● Meaning often depends on the context (surrounding words)
● Self-Attention in Transformers processes context when encoding a token
● Can we just take the Transformer Encoder and use it to embed words?
NLP 101 RNNs Attention Transformers BERT
94. (Contextual) word embeddings
BERT: Bidirectional Encoder Representations from Transformers
Instead of Word2Vec, use pre-trained BERT for your NLP tasks ;)
NLP 101 RNNs Attention Transformers BERT
95. What was BERT pre-trained on?
Task # 1: Masked Language Modeling
● Take a text corpus, randomly mask 15% of words
● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words
that are present to predict which word is missing
NLP 101 RNNs Attention Transformers BERT
96. What was BERT pre-trained on?
Task # 1: Masked Language Modeling
● Take a text corpus, randomly mask 15% of words
● Put a softmax layer (dim = vocab size) on top of BERT encodings for the words
that are present to predict which word is missing
Task # 2: Next Sentence Prediction
● Input sequence: two sentences separated by a special token
● Binary classification: are these sentences successive or not?
NLP 101 RNNs Attention Transformers BERT
97. Further reading and coding
My blog posts:
towardsdatascience.com/natural-language-processing-the-age-of-transformers-a36c0
265937d
blog.scaleway.com/understanding-text-with-bert/
The HuggingFace 🤗 Transformers library: huggingface.co/
NLP 101 RNNs Attention Transformers BERT