Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
A Simple Introduction to Word EmbeddingsBhaskar Mitra
Â
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
An introduction to the Transformers architecture and BERTSuman Debnath
Â
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
A Simple Introduction to Word EmbeddingsBhaskar Mitra
Â
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
An introduction to the Transformers architecture and BERTSuman Debnath
Â
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
Â
** NLP Using Python: - https://www.edureka.co/python-natural-language-processing-course **
This Edureka PPT will provide you with a comprehensive and detailed knowledge of Natural Language Processing, popularly known as NLP. You will also learn about the different steps involved in processing the human language like Tokenization, Stemming, Lemmatization and much more along with a demo on each one of the topics.
The following topics covered in this PPT:
1. The Evolution of Human Language
2. What is Text Mining?
3. What is Natural Language Processing?
4. Applications of NLP
5. NLP Components and Demo
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Â
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
Continuous representations of words and documents, which is recently referred to as Word Embeddings, have recently demonstrated large advancements in many of the Natural language processing tasks.
In this presentation we will provide an introduction to the most common methods of learning these representations. As well as previous methods in building these representations before the recent advances in deep learning, such as dimensionality reduction on the word co-occurrence matrix.
Moreover, we will present the continuous bag of word model (CBOW), one of the most successful models for word embeddings and one of the core models in word2vec, and in brief a glance of many other models of building representations for other tasks such as knowledge base embeddings.
Finally, we will motivate the potential of using such embeddings for many tasks that could be of importance for the group, such as semantic similarity, document clustering and retrieval.
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
Â
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
Basics covered regarding Natural Language Processing, How ANN transformed to RNN, Architectures of vanila RNN, LSTM and GRU and few preprocessing techniques
Brief introduction on attention mechanism and its application in neural machine translation, especially in transformer, where attention was used to remove RNNs completely from NMT.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Our project is about guessing the correct missing
word in a given sentence. To find of guess the missing word
we have two main methods one of them statistical language
modeling, while the other is neural language models.
Statistical language modeling depend on the frequency of the
relation between words and here we use Markov chain. Since
neural language models uses artificial neural networks which
uses deep learning, here we use BERT which is the state of art
in language modeling provided by google.
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
Â
** NLP Using Python: - https://www.edureka.co/python-natural-language-processing-course **
This Edureka PPT will provide you with a comprehensive and detailed knowledge of Natural Language Processing, popularly known as NLP. You will also learn about the different steps involved in processing the human language like Tokenization, Stemming, Lemmatization and much more along with a demo on each one of the topics.
The following topics covered in this PPT:
1. The Evolution of Human Language
2. What is Text Mining?
3. What is Natural Language Processing?
4. Applications of NLP
5. NLP Components and Demo
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Â
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
Continuous representations of words and documents, which is recently referred to as Word Embeddings, have recently demonstrated large advancements in many of the Natural language processing tasks.
In this presentation we will provide an introduction to the most common methods of learning these representations. As well as previous methods in building these representations before the recent advances in deep learning, such as dimensionality reduction on the word co-occurrence matrix.
Moreover, we will present the continuous bag of word model (CBOW), one of the most successful models for word embeddings and one of the core models in word2vec, and in brief a glance of many other models of building representations for other tasks such as knowledge base embeddings.
Finally, we will motivate the potential of using such embeddings for many tasks that could be of importance for the group, such as semantic similarity, document clustering and retrieval.
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
Â
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
Basics covered regarding Natural Language Processing, How ANN transformed to RNN, Architectures of vanila RNN, LSTM and GRU and few preprocessing techniques
Brief introduction on attention mechanism and its application in neural machine translation, especially in transformer, where attention was used to remove RNNs completely from NMT.
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Our project is about guessing the correct missing
word in a given sentence. To find of guess the missing word
we have two main methods one of them statistical language
modeling, while the other is neural language models.
Statistical language modeling depend on the frequency of the
relation between words and here we use Markov chain. Since
neural language models uses artificial neural networks which
uses deep learning, here we use BERT which is the state of art
in language modeling provided by google.
DELAB - sequence generation seminar
Title
Open vocabulary problem
Table of contents
1. Open vocabulary problem
1-1. Open vocabulary problem
1-2. Ignore rare words
1-3. Approximative Softmax
1-4. Back-off Models
1-5. Character-level model
2. Solution1: Byte Pair Encoding(BPE)
3. Solution2: WordPieceModel(WPM)
Presented by Ted Xiao at RobotXSpace on 4/18/2017. This workshop covers the fundamentals of Natural Language Processing, crucial NLP approaches, and an overview of NLP in industry.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
Â
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
Â
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Deep neural methods have recently demonstrated significant performance improvements in several IR tasks. In this lecture, we will present a brief overview of deep models for ranking and retrieval.
This is a follow-up lecture to "Neural Learning to Rank" (https://www.slideshare.net/BhaskarMitra3/neural-learning-to-rank-231759858)
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Â
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Â
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
This presentation goes into the details of word embeddings, applications, learning word embeddings through shallow neural network , Continuous Bag of Words Model.
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Zachary S. Brown
Â
This talk covers the fundamental building blocks of neural network architectures and how theyâre used to tackle problems in modern natural language processing. Topics include an overview of language vector representations, text classification, named entity recognition, and sequence-to-sequence modeling approaches. Dr. Brown emphasizes the shape of these types of problems from the perspective of deep-learning architectures, which will help attendees successfully identify the most applicable neural network techniques to new problems they encounter.
Deep dive into the world of word vectors. We will cover - Bigram model, Skip-gram, CBOW, GLO. Starting from simplest models, we will journey through key results and ideas in this area.
Deep dive into the world of word vectors. We will cover - Bigram model, Skip-gram, CBOW, GLO. Starting from simplest models, we will journey through key results and ideas in this area.
https://www.meetup.com/Deep-Learning-Bangalore/events/239996690/
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
Â
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Opendatabay - Open Data Marketplace.pptxOpendatabay
Â
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
ð Key findings include:
ð Increased frequency and complexity of cyber threats.
ð Escalation of state-sponsored and criminally motivated cyber operations.
ð Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
2. What is Word Embedding?
⢠Natural language processing (NLP) models do not work with plain text. So, a numerical
representation was required.
⢠Word embedding is a class of techniques where word is represented as a real value
vectors.
⢠It is a representation of word in a continuous vector space.
⢠It is a dense representation in a vector space.
⢠It can be represented in smaller dimension compared to sparse representation like one-
hot encoding.
⢠Most of the word embedding method is based on âdistributional hypothesisâ by Zelling
Harris.
3. What is word embedding? continued
⢠The Distributional Hypothesis is that words that occur in the same contexts tend to have
similar meanings. (Harris, 1954)
⢠Word embeddings are designed to capture the similarity between representation like:
meaning, morphology, context etc.
⢠The captured relationship helps us to work on downstream NLP task like chat-bot, text
summarization, information retrieval etc.
⢠It is generated by co-occurrence matrix, dimensionality reduction and neural networks.
⢠It can be broadly categorized in two parts: frequency-based embeddings and prediction-
based embeddings.
⢠The earliest work to give a vector representation was vector space model used in
information retrieval task.
4. Vector space model
⢠A document was represented in a vector space.
⢠The dimensionality of vector space is of size of unique words in corpora.
Term 1
Term 2
Term 3
Doc 1
Doc 2
Doc 3
Term
1
Term
2
Term
3
Doc 1 0 5 5
Doc 2 2 0 1
Doc 3 3 3 0
⢠Hypothetical corpora with three
words represented as dimension.
⢠Three doc projected in the vector
space as per their term frequency
5. Vector space model continued
⢠The document got a numerical vector representation in a vector space represented by words.
⢠E.g.
⢠Doc 1 -> [0, 5, 5]
⢠Doc 2 -> [2, 0, 1]
⢠This representation is sparse in nature. Because, in real life scenario the dimensionality of a corpus
shoots up to millions.
⢠It is based on term frequency.
⢠TF-IDF normalization is applied to reduce the weightage of frequent words like âtheâ, âareâ , and etc.
⢠One-hot encoding is a similar technique to represent a sentence/document in vector space.
⢠This representation gather limited information and fails to capture the context of the word.
6. Co-occurrence matrix
⢠It is applied to capture the neighbouring word that appeared with the word under
consideration. A context window is considered to calculate co-occurrence.
⢠E.g.:
⢠India won the match. I like the match.
⢠Co-occurrence matrix for above two sentence for context window of 1.
India won the match I like
India 1 1 0 0 0 0
won 1 1 1 0 0 0
the 0 1 1 1 0 1
match 0 0 1 1 0 0
I 0 0 0 0 1 1
like 0 0 1 0 1 1
7. Co-occurrence matrix continued
⢠Representations like One-hot encoding, Count based method and co-occurrence matrix
based methods are very sparse in nature.
⢠Either context was limited or absent all together.
⢠Single representation for word in every context.
⢠Relation between two words like: semantic reasoning is not possible with this
representation.
⢠Context is limited but predetermined.
⢠Long term dependencies are not captured.
8. Prediction based word embeddings
⢠It is a method to learn dense representation of word from a very high dimensional
representation.
⢠It is a modular representation, where a sparse vector is fed to generate a dense
representation
Word
Word
embedding
Model
One hot encoded representation - India
= [0, 1, 0, .... 0]
[0, 1, 0, .... 0]
V(India) = [0.1, 2.3, -2.1, ...., 0.1]
9. Language modelling
⢠Word Embedding models are very closely related to Language modelling.
⢠Language modelling tries to learn a probability distribution over the words in Vocabulary (V)
⢠Prime task of language model to calculate the probability a word Wi given the previous (N-1)
words, mathematically ð(ðð|ððâ1
, . . . ððâð+1)
⢠Probabilities over n-gram is calculated by frequency by constituent n-gram.
⢠In Neural network as well we achieve the same using softmax layer.
⢠We calculate the log probability of ðð and normalize it with the sum of the probablities over all the words.
⢠ð(ðð|ððâ1
, . . . ððâð+1) =
ðð¥ð(âð
ððð
â
)
ððâð ðð¥ð(âð
ððð
â
)
⢠In this case, h is the representation from hidden layer and ðð
ð
is the embedding of the word.
⢠The inner product of âð
ððð
â
generate the log probability of word ðð.
10. Classical Neural language model
⢠It was proposed by Bengio et al., 2003
⢠It consists of one layer feed-forward neural
network to predict next word in sequence.
⢠The model tries to maximize the probability
as computed by softmax.
⢠Bengio et.al. introduced three concepts
⢠Embedding layer: a layer that generates
word embeddings by multiplying an index
vector with a word embedding matrix.
11. Classical Neural language model continued
⢠Intermediate layers: One or more layers that produce an intermediate representation of
the input, e.g. a fully-connected layer that applies a non-linearity to the concatenation of
word embeddings of ð previous word
⢠Softmax Layer: the final layer that produces a probability distribution over words in V
⢠Intermediate layer can be replaced with LSTM.
⢠The network has computational complexity bottleneck due to softmax layer, in which
probability over the set of vocabulary needs to be computed.
⢠Neural based work embedding model made a significant progress with Word2vec model
proposed by Mikolov et.al. in 2013
12. Word2Vec
⢠It was proposed by Mikolov et.al. in 2013.
⢠It is a two layer shallow neural network trained to learn the contextual relationship.
⢠It places contextually similar word near to each other.
⢠It is a co-occurrence based model.
⢠Two variants of the model was proposed
⢠Continuous bag of words model (CBOW)
⢠Given the context word, predict the center word
⢠Order of context words are not considered, so this representation is similar to BOW.
⢠Skip-gram model
13. What does context mean?
⢠Context is co-occurrence of the words. It is a sliding window around the word under the
consideration.
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
India is now inching towards a self reliant state
Window size = 2, Yellow patches are words are in consideration, orange box are the context window
14. CBOW continued
⢠Goal: Predict the center word, given the context words.
One hot vectors
ðð¡â2
ðð¡â1
ðð¡+1
ðð¡+2
C
Projection matrix of shape
V x D (to be learned)
ðð * P
Average of context vectors
softmax layer
Output projection matrix M
of dimension D X V
C.M
one-hot vector of ðð¡
cross entropy loss
15. CBOW continued
⢠One hot encoded of the context words ðð¡â2 , .... ðð¡+2 is input to the model.
⢠Projection matrix of shape V x D, where is V is the total no of unique words in the corpus and
D is dimension of the dense representation, to project one-hot encoded vector into D-
dimension vector.
⢠Averaged context vector is projected back to V-dimensional space. Softmax layer converts the
representation into proablities for ðð¡.
⢠The model is trained using cross-entropy loss between the softmax layer output and the one-
hot encoded representation of ðð¡.
16. Skip-gram model: High level
⢠Goal: To predict the context word ðð¡â2 , .... ðð¡+2 given the word ðð¡
One hot vectors
ðð¡â2
ðð¡â1
ðð¡+1
ðð¡+2
C
Projection matrix of shape
V x D (to be learned)
C * M
softmax layer
Output projection matrix M
of dimension D X V
C.M
one-hot vector of ðð¡
cross entropy loss
ðð¡ ðð¡ â ð
17. Skip-gram continued
⢠A end to end flow of training:
0
0
0
1
0
0
0
0
ðð¡
- - 0.2 -
- - 0.1 -
- - 0.4 -
- - 0.8 -
- - 0.2 -
ðžðððððððð ððð¡ððð¥ð
Vx1
DxV
0.2
0.1
0.4
0.8
0.2
Dx1
- - 0.8 - -
- - 0.1 - -
- - 0.4 - -
- - 0.8 - -
- - 0.2 - -
- - 0.9 - -
- - 0.2 - -
- - 0.6 - -
Context Maxtrix
(Shared with all
context word
prediction)
VxD (ð¢ð
ð
)
0.1
0.2
1.3
0.4
0
.6
.7
.8
Vx1 (ðð¡â1)
ð¢ð
ððð
ðð
0.076
0.084
0.252
0.102
0.068
0.125
0.138
0.153
Softmax
0
0
0
0
1
0
0
0
Ground
truth
ð ððð¡ððð¥(ð¢ð
ð
ðð)
This representation is taken from the Lectures of Manning on YouTube
Ã
= =
softmax
18. Skip-gram continued
⢠It focus on optimization of loss for each word:
⢠ð ð|ð =
exp ð¢0
ðð£ð
ð=1
ð exp ð¢ð€
ð ð£ð
⢠It calculates probability of output context word given the center word c.
⢠The loss function which it tries to minimize is:
⢠Log value is calculated similar to depicted in first equation.
⢠Naive Training is costly because gradient calculation is of order ð
⢠Two computationally efficient methods are proposed:
⢠Hierarchical Softmax
⢠Negative sampling
ðœ ð = â
1
ð ð¡=1
ð
âð â€ðâ€ð
ðâ 0
ðððð(ðð¡+ð ðð¡)
19. Skip-gram continued
⢠Use of negative sampling is to train is more prevalent.
⢠In early example, tuple like (India, the), (India, now) are the example to true cases.
⢠Any corrupted tuple is called as negative sample. like (India, reliant), (India, state)
⢠This process with modified objective function results it in a logistic regression to classify a
tuple as a true combination or a corrupt ones.
⢠The corrupt tuple is generated by sampling such that less frequent words are picked up
more often as a corrupt tuple.
20. Word Embedding visualization
t-SNE 2. D projection of Word2vec (gensim implementation) embeddings of top 10 similar words, trained for 50 epoch on Reuters news
corpus from NLTK, with context len 15, vector dimension 100
Top 5 similar words
crude barrel 0.548
crude oteiba 0.464
crude netback 0.45
crude refinery 0.438
crude pipeline 0.421
-------
ship vessel 0.623
ship port 0.575
ship tanker 0.496
ship navigation 0.471
ship crane 0.463
-------
computer software 0.602
computer micro 0.559
computer printer 0.542
computer mainframe 0.538
computer hemdale 0.527
--------
⢠Even with a smaller corpus it can
capture semanticallly relevant
words.
22. Analogies
⢠Representation of analogy in vector space using word2vec vectors:
⢠Vector representation of âKing-man+womanâ is
roughly equivalent to the vector representation
of queen
⢠Using gensim and pretrained word2vec the
analogy vector generated for âKing-
man+womanâ 5 most closer relationships are
queen 0.7118
monarch 0.619
princess 0.5902
crown_prince 0.55
prince 0.54
Image taken from https://jalammar.github.io/illustrated-word2vec/
23. GloVe: Global Vectors for Word Representation
⢠It is a unsupervised learning to learn the word representation.
⢠It is based on co-occurrence matrix.
⢠The co-occurrence matrix built on the whole corpus.
⢠It is able to capture global context.
⢠It encompass best of the two model families:
⢠Local context window method
⢠Global matrix factorization
⢠Earlier matrix factorization like LSA was used to reduce the dimensionality.
⢠Two things that GloVe model captures:
⢠Statistical measure using co-occurrence matrix
⢠Context, by considering the neighbouring words
24. Glove continued
⢠It moves away from old matrix factorization. By considering relationship reasoning
(Semantic and syntactic), GloVe tries to learn the representation for words.
⢠It can be represented as:
Word co-occurrence matrix
Word feature
matrix
(Embedding
matrix)
Feature context matrix
*
=
words
context
features
words
context
features
25. GloVe continued
⢠How does GloVe learns embedding?
⢠It considers word-word co-occurrence probabilities as the potential of relation between words.
⢠The authors presented a relation with âsteamâ and âiceâ as target words.
⢠It is common to consider steam occur with gas and ice with solid.
⢠Other co-occur words are âwaterâ and âfashionâ. âWaterâ has some shared property while âfashionâ is
irrelevant.
⢠Only in the ratio of probabilities cancels out the noisy words like âwaterâ and âfashionâ.
⢠As presented in the above table, the ratio of probabilities are maximum for ð(ð ððð) /ð(ð ð ð¡ððð) is
high for solid and small for gas.
26. GloVe continued
⢠What is the optimization function for GloVe?
⢠In a co-occurrence matrix ð the ððð represents the co-occurrence count.
⢠ðð is the total number of times the word appears in the context.
⢠ððð = ð(ð ð) = ððð ðð is the probability of word j appear in the context of word ðð
⢠For a combination of three words ð, ð, ð. A general representation of the model is
⢠The optimization function proposed by authors are:
ð¹(ðð, ðð, ðð) =
ððð
ððð
ðœ =
ð,ð=1
ð
ð(ððð)(ðð
ð
ðð + ðð + ðð â log ððð)
2
27. Glove Continued
⢠Where ð is the size of vocabulary and ðð
ð
and ðð is the vector and bias of the word ðð
and ðð and ðð is the context vector and its bias. The last term is the probability of
occurring i in the context of j.
⢠The function ð(ð) should have following properties:
⢠It tends to zero at when ð â 0
⢠It should be non-decreasing so that it can discriminate rare co-occurrence instances.
⢠It should not overweight frequent co-occurrence.
⢠The choice of ð(ð) is
ð(ð) = (ð ðððð¥)
ðŒ
ðð ð < ðððð¥ ðð¡âððð€ðð ð 1
28. Glove Continued
⢠The model has following computational bottlenecks:
⢠Creating a big co-occurrence matrix of size ð ð ð.
⢠The model computational complexity depends on the number of non-zero elements.
⢠During training the context window needs to be sufficiently large so that the model can
distinguish left context and right context.
⢠Words which are more distant to each other contribute less in the count. Because, distant
words contribute less to the relationship of the words.
⢠The model generates two set of ð ððð ð. An average of both is used as the representation of
words.
29. Glove results
t-SNE 2D projection of Glove embeddings of top 5 similar words, trained for 50 epoch on Reuters news corpus from
NLTK, with context len 15, vector dimension 100
Top 5 most relevant word list:
----------------------------------------
crude barrel 0.752
crude posting 0.58
crude raise 0.537
crude light 0.505
crude sour 0.502
----
ship loading 0.58
ship kuwaiti 0.54
ship missile 0.537
ship vessel 0.522
ship flag 0.522
-----
computer wallace 0.595
computer software 0.592
computer microfilm 0.559
computer microchip 0.536
computer technology 0.52
30. Are Word2vec and GloVe enough?
⢠Both the embeddings can not deal with out of vocabulary words.
⢠Both can capture the context, but in a limited sense.
⢠They always produce single embedding for the word in cosideration.
⢠They canât distinguish:
⢠âI went to a bank.â and âI was standing at a river bank.â
⢠It will always produce single representation for both the context.
⢠Both gives decent performance than encoding like tf-idf, count vector etc.
⢠Does pretrained model helps the case?
⢠Pretrained models on huge corpus shows better performance compared to small corpus.
⢠Pretrained models of Word2vec2 is available from Google and GloVe1 is available on Stanfordâs
website.
1. https://nlp.stanford.edu/projects/glove/
2. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
31. Fasttext
⢠It was proposed by Facebook AI team.
⢠It was primarily meant to handle the out of vocabulary issue of GloVe and Word2vec.
⢠It is an extension of Word2vec.
⢠This model relies on n-gram character rather than word to generate the embeddings.
⢠This model relies on the morphological features of a word.
⢠The n-gram character of a word can be represented as below:
⢠For word <where> and n=3 the word n-gram characters are:
⢠<wh, whe, her, ere, re>
⢠The final representation for word âWhereâ is the sum of the vector representation of <wh,
whe, her, ere, re>.
32. Fasttext continued
⢠The modified scoring function is
⢠Where ðºð is the set of n-grams for word ð€ and ðð is the vector representation of n-gram.
⢠The n-gram vector learning enables the model to learn the representation for out-of-
vocabulary words as well.
ð(ð€, ð) =
ðâðºð
ðð
ð
ðð
33. Fasttext results
t-SNE 2D projection of fasttext embeddings (gensim implementation) of top 15 similar words, trained for 50 epoch on
Reuters news corpus from NLTK, with context len 15, vector dimension 100
Top 5 most relevant word list:
-----------------------------------------
crude cruz 0.582
crude barrel 0.561
crude cruise 0.501
crude crumble 0.433
crude jude 0.41
-----
ship shipyard 0.714
ship steamship 0.703
ship shipowner 0.688
ship shipper 0.668
ship vessel 0.667
-----
computer supercomputer 0.843
computer computerized 0.823
computer computerland 0.773
computer software 0.54
computer microfilm 0.52
34. Observation
⢠None of the representation can capture the contextual representation. Meaning that
representation based on the usage.
⢠These are based on dictionary based look up to get the embeddings.
⢠Limited performance on task such as question answering, summarization compare to
current state of art models like ELMO (LSTM based), BERT (transformer based) etc.
35. ELMo
⢠Language is complex, the meaning of a word can vary from context to context.
⢠E.g.
⢠I went to a bank to deposit some money.
⢠I was standing at a river bank.
⢠Both the instances of âbankâ have separate meaning.
⢠Earlier models have same meaning for the word in each scenario.
⢠A solution would be to have multiple level of understanding of text.
⢠A new model which can capture context :
⢠ELMo: Embeddings from Language Model representation
36. ELMo continued
⢠It is based on deep learning framework
⢠It generates contextualized embedding for a word.
⢠It models complex characteristics of word (e.g. syntactic and semantic features)
⢠It models linguistic variation contexts like polysemy.
⢠The author argues that it can capture abstract linguistic characteristics in the higher level
of layers.
⢠It is based on bi-directional LSTM based model.
⢠Bi-directional model helps to capture the dependency on the past words and the future words.
37. ELMo Architecture
⢠Block diagram of ELMo.
ðððð ðžðððððððð ð¥ð
LSTM LSTM
LSTM LSTM
Softmax
⯠â¯
L-layers
⢠The number of layers in original
implementation was two.
⢠Word embedding is calculated by char-
CNN
⢠Final embeddings are generated as a
weighted sum of hidden layers and
embedding layer.
⢠Three different representation can be
obtained
⢠Hidden layer 1
⢠Hidden layer 2
⢠Weighted sum of hidden layer and
embedding layer
⢠Go through each block in coming slide.
38. ELMo input Embedding
⢠Input embedding is generated by a combination of layers with char CNN and highway
network.
Character embedding
CNN with Max pool
2 layer highway network
LSTM
India
https://www.mihaileric.com/posts/deep-contextualized-word-representations-elmo/
39. Character CNN embedding and Highway network
Embedding Layer
⢠In first step look up is performed
to get character embedding.
⢠1-D convolution is applied on the
embedding followed by Max
pooling layer
⢠Highway network acts as a gate
that determines how much
original information can pass to
output and how much via
projection.
⢠Final generated output is passed
as input to 2-layer LSTM
structure.
⢠In original paper there were two
highway layers
⢠Character level embedding
enables it to learn a
representation for any word. So,
it can handle out of vocabulary
words as well
Source: http://web.stanford.edu/class/cs224n/
40. LSTM layer and Embedding
⢠The architecture has bi-directional LSTM to predict next word from both sides. It creates biLM.
Full embedding method is described in picture
â¢
Concatenate hidden layer
representation âð2
ð¿ð
âð1
ð¿ð
âð0
ð¿ð
âðð
ð¿ð
âðð
ð¿ð
ðð ðð
⢠Two separate LSTM layers
implements language model
in both direction.
⢠Forward LM predicts top
layer predicts the next token
using the softmax layer.
⢠Similarly backward LM
predicts the past token
using the softmax score
⢠Each of the forward L-layers
of LSTM generates
contextualized
representation for the words
ð¡ð âð,ð
ð¿ð
ð€âððð ð is 1,2,3 . . . ð¿
41. LSTM layer and Embedding continued
⢠Similarly, Each of the backward L-layers of LSTM generates contextualized
representation for the words ð¡ð âð,ð
ð¿ð
ð€âððð ð is 1,2,3 . . . ð¿.
⢠There are total 2ð¿ + 1 representation generated. 2ð¿ by the hidden layer and one by the
embedding layer.
⢠Final representation will be a weighted combination of the hidden concatenated vector
and Embedding layer.
42. ELMo continued
⢠Representation of the word âbankâ in different context. Vectors are projected in 2D space
based on the vector representation of the word âbankâ
⢠It can generate different
embedding for a word
depending on the context.
⢠Projection is based on the
average of the hidden
layerâs and embedding
layerâs representation
from pretrained model
from tensorflow hub.
⢠It can be tuned to perform
different task like coref
resolution, sentiment
analysis, Q&A answering.
43. ELMo continued
⢠From the results presented by the authors, higher layer tends to capture semantic feature
and lower layers captures syntactic features.
⢠The second layerâs embedding outperform first layerâs embedding in word sense
disambiguation task which is a syntactic task in nature. On the other in the POS tagging
task, first layerâs embedding outperforms the second layerâs embedding.
44. Bidirectional Encoder Representations from
Transformers(BERT)
⢠A transformer based model to learn contextual representation.
⢠It is designed to pre-train deep bidirectional representations from unlabeled text by
considering both left and right context.
⢠It is pre-trained model.
⢠It uses the concept of transfer learning, similar to what ELMo does.
⢠Any use of BERT is based on two step process:
⢠Train a large language model on huge amount of data, using unsupervised or semi-supervised
method.
⢠Fine tune the large model for specific NLP task.
⢠Before we delve further into BERT. Following concepts needs to be understood.
⢠Attention mechanism
⢠Transformer architecture proposed by Vaswani et.al.
45. Attention mechanism
⢠This concept was brought in from computer vision task to NLP.
⢠First use of attention mechanism was applied by Bahdanau et al. in 2015. It was based on
the additive mechanism.
⢠It is similar to the way we process an image in our brain. It is like we focus on some of
the parts and infer other parts based on those information. In an image not all information
gives similar information.
⢠In sentence processing as well, we attend relevance between some of the words while
other gets low attention.
I was going to a crowded market.
High level of attention
Low level of attention
46. Attention mechanism
⢠In the above example, we tend to attend âmarketâ in the context âcrowdedâ and âgoingâ in
the context of âmarketâ. On the other hand, less attention to âgoingâ and âcontextâ.
⢠Attention mechanism in NLP was first employed in the neural machine translation task.
⢠Seq-to-seq model , encode-decoder model, has an issue with longer sentence. It fails to
remember relation between the distant words.
⢠Attention mechanism was designed to capture long distance relationship between two
words.
⢠The attention mechanism can be understood as a vector of importance weights.
⢠In the next slide, I shall try to put the basic of attention mechanism.
47. Attention mechanism continued
Key 1 Key 4
Key 3
Key 2
Score 1 Score 4
Score 3
Score 2
Query
⢠The intuition can be like a query fired by
us to extract some information from
database. It matches against each key
and generates a similarity score.
⢠Keyi and scorei are vector of some
dimension d.
⢠The scoring function can be of different
type:
⢠Simple dot product ðð
ðŸð
⢠Scaled dot product ðððŸð
ð
⢠General dot product ðð
ððŸð
⢠Additive dot product ðð
ð
ðŸð + ðð
ð
ð
ð1 ð4
ð3
ð2
à à à Ã
Calculated using
softmax operation
ð£1 ð£2 ð£3 ð£4
ððð£ð
Attention value
48. Attention mechanism continued
⢠Final attention value is a vector. While ðð and scale1 is a scaler.
⢠The general frame work of attention mechanism generates a weightage score of the value
vectors.
⢠Attention mechanism is also known as intra-attention.
⢠It is a attention mechanism to relate different words of a single sequence to generate a
representation of the same sequence.
⢠With the basic understanding of attention mechanism. We can move the transformer
architecture architecture proposed by Vashwani et. al.
49. Transformer model
Image source: https://arxiv.org/pdf/1706.03762.pdf
⢠Transformer architecture was proposed by Google AI
team.
⢠It is encode-decoder architecture
⢠Core modules of this architecture is
⢠Multi-head attention
⢠Positional Encoding
⢠Attention Mechanism
⢠Masked multi-head attention
⢠Residual connections
⢠This model solves the issue with recurrent networks:
⢠Failure to capture long distance word to word
relation
⢠RNN are sequential in nature. This architecture
can be parallelized.
⢠Each head of attention learns a different set of
features.
⢠This model does away with the recurrence.
50. Input embedding and positional encoding
⢠Embeddings are collected from some pre-trained model using dictionary look up.
⢠Unlike RNN, which takes input sequentially. It takes whole sentence as an input.
⢠Without positional information, it will be similar to bag of words model.
⢠How does positional information is calculated?
⢠Positional embedding is calculated by using the alternating combination of sin and cosine.
⢠ððž(ððð , 2ð) = sin
ððð
1000
2ð
ð
and ððž(ððð , 2ð) = cos
ððð
1000
2ð
ð
⢠Why do they use periodic function with varying frequency?
⢠This encoding method will generate entirely different encoding for each position as well as the
distance between two time steps will be consistent.
⢠The position should be deterministic.
⢠The author views this encoding will ensure that any ððžððð +ð can be represented as a linear
combination of ððžððð .
51. Input embedding and positional encoding
ðð
ðð
ð
ðð
+
ðð
ðð
ð
ðð
+
ðð
ðð
ð
ðð
+
Embedding vector is combined with positional vector. The generated Embedding vector fed into attention layer. In this way, positional
information is combined with the word embeddings.
53. Multi-Head attention
Linear Linear Linear
MatMul
Scale
Softmax
MatMul
Concatenate
Linear ⢠Linear layer maps input to output as well or
change the vector dimension. Its weight are
learned in training process.
⢠In Case of original paper, it was the 512
dimension was projected to 64 dim vector
space.
⢠What do we feed into query, key and value?
⢠At the start of the training same copy of
the word vectors are passed as
Query(Q), Key(K) and Value(V)
⢠The MatMul operation will generate a matrix
that is precursor to attention filter. The values
will be scaled by ð
⢠Finally the softmax layer will generate the
probability across all the words.
⢠It results in a matrix that is called as
âattention filterâ.
Query Key Value
54. Multi-Head attention
⢠The probability matrix is multiplied with ð that will generate a representation based on the
attention score.
⢠This whole process was a self-attention mechanism to generate encoding with scaled dot
product scoring function.
⢠The above explanation corresponds to single head. There can be many such heads.
Original model has 8 heads.
⢠Each headâs attention filter learns different feature.
⢠Representation generated from the several heads are concatenated and passed to next
stacked encoder.
⢠Final representation from the encode stack is fed into decoder stack.
55. Add and Norm layer
⢠Residual connection is placed in model for
⢠Knowledge preservation
⢠Handling of vanishing gradient problem.
⢠In this model, normalization part is tricky in nature.
⢠Normalization mean and standard deviation is calculated for each of the wordâs representation.
⢠Using calculated mean and standard deviation, the layerâs value is normalized.
56. Masked multi-head attention
⢠It is an important feature of the decode stack.
⢠Why do we need masking?
⢠When we are generating some output. The it should not pay attention to future words because future words
has not been predicted yet. So, those words are masked.
⢠Masking take place by putting future words as -â. It will ensure that the softmax, which is an exponent,
becomes zero for the word.
⢠Does this model have recurrence?
⢠In a cursory glance, it appears so.
⢠In decoder, we feed previous token as an input to the model.
⢠But, it was trained using concept called as teacher forcing.
⢠It means that when output is known then we can directly supply the output representation to the model.
⢠By doing so the model can be parallelized as well.
⢠The base model has 8 heads, with 6 layers of encode decoder stack, key and value are of
dimension 64.
57. Did Transformer changed the landscape in NLP?
⢠One of the best known model GPT was based on Transformer architecture.
⢠Current models like, BERT and its variants, GPT-2 are based on Transformer architecture.
⢠It started to capture the space of RNN because it can be parallelized and can capture
context.
⢠It has beat the benchmark in NMT task.
58. Getting back to BERT
⢠Proposed by Devlin et.al. in 2019.
⢠It is an encoder based model.
⢠It is based on the transformer architecture that we discussed in previous slides.
⢠It has only the encoder stack from the transformer architecture.
⢠It is a unsupervised or semi-supervised pre-trained model, fine tuned for a specific task
like Q&A, conversational AI etc
⢠It is sub-word model comprises of 30000 vocabulary set. BERT tokenizer will tokenize the
words. So, representation of the tokenized word may not be similar to what we have
passed as input.
⢠e.g: Word âembeddingsâ is tokenized into [âemâ, â##bedâ, â##dingâ, â##sâ]
⢠This approach helps to address out of vocabulary words as well.
⢠There are two most common architecture pre-trained models are: ðµðžð ðððð ð, ðµðžð ðððððð.
⢠ðµðžð ðððð ð has 12 stacked encoders and ðµðžð ðððððð has 24 stacked encoders.
59. BERT Continued
⢠The base version has 12 attention heads and âlargeâ version has 16 attention heads.
⢠The original configuration described in the paper had 6 encoder layers and 8 attention
heads. This leads to a 512 dimension representation.
⢠How to collect the representation from BERT?
⢠BERT uses two special token [CLS] and [SEP].
⢠[CLS] will always be the first token of the input.
⢠[SEP] represents the sentence segmentation.
⢠We need to provide the segment id as well in the input.
⢠Similar to original transformer model, the encoded embeddings are being passed to
subsequent encoders.
⢠Each of the position will output a vector representation for token of size 768 for BERTbase
and 1024 for the large model
60. BERT Continued
⢠For the classification task, we only focus on the embedding of [CLS] token.
⢠What would be the representation for other task:
⢠There are several variants of the embedding collection.
⢠Considering the ðµðžð ðððð ð with 12 encoder layer will generate 12 + 1 embeddings. One for the
input layer.
⢠Which layer should we use: All or some?
⢠There are several experiments have been performed. The best performance can be obtained by
concatenating the last 4 layerâs representation
⢠Next best representative would be an average of last 4 layerâs representation.
⢠Each layer learns different features. So, pooling strategy would be dependent upon the
specific NLP task. The above two suggestions are based on the performance on NER
tagging task.
61. BERT Architecture
⢠As original paper proposed, the BERT input embedding is a sum of token embedding,
segment embeddings and Position embeddings
Image source: https://arxiv.org/pdf/1810.04805.pdf
62. BERT pre-training
⢠BERT is pretrained using two methods:
⢠Masked LM model:
⢠Unlike the masking that we discussed in Transformer architecture. In this case, some
random words are replaced with a special token [MASK ]. Approximately 15% of all the
sub-word tokens.
⢠Now, the prediction can be defined as a language model, given the left and right context
ð(ð¡ð|ð¡1, ð¡2, . . , ð¡ðâ1, ð¡ð+1, . . . , ð¡ð)
⢠Masking strategy are further divided into three parts:
⢠80% instances it is [MASK] token
⢠10% times it is a random words
⢠10% time there is no change
63. BERT pre-training
⢠Next sentence prediction:
⢠This is primarily a classification task whether the next sentence follows the
previous sentence or not.
⢠The training set has 50% instance when sentence B follows sentence A and
50% are negative cases where sentence B is replaced with some random
sentence.
⢠This type of training is helpful in Q&A task. Where Question and answer are
represented as pairwise.
⢠The pre-trained BERT embeddings is new State-of-the-art word
embedding representation for the most of the NLP tasks.
⢠The original model has outperformed previous SOTA benchmarks.
64. BERT attention layer visualization
⢠Two sentence are taken as input
⢠"Who does not like chocolate"
⢠"Even a grown up would want to have a nice bite"
⢠Using BERTviz tool1, visualization of attention from second sentence to first sentence.
Attention by head 11 of layer 11 Attention by head 1 of layer 11
65. BERT attention visualization
⢠It appears that different heads even in same layer can capture different relation among
sentences. In head 11, all words attended âChocolateâ while in head 0, attention was on
most of the words.
⢠In attention between same sentence. It
appears almost all the words attended âbiteâ.
The attention from âwantâ, âhaveâ, âniceâ is
higher towards âbiteâ.
⢠We observe that different layer captures
different features. The type of feature captured
by each head may not be understood in exact
terms like syntactic, semantic etc.
66. BERT pre-trained models
⢠There are several pretrained models are available from HuggingFace1 team.
⢠Was ðµðžð ðð¿ðððð with >300M is big?
⢠Can we squeeze the performance into some smaller model?
⢠Should we train more with more layers and attention heads?
⢠Both the questions have same answer - Yes
⢠Two models: DitilBERT and GPT-2-XL are the answer of above question.
⢠DistilBERT is smaller model with similar performance.
⢠GPT-2-XL has 48 layers!!!
⢠A better training strategy can give a better result. ð ððµðžð ððððððð is a robustly trained
version of ðµðžð ðð¿ðððð. Similarly, a base version of RoBERTa
1. https://huggingface.co/transformers/pretrained_models.html
67. DistilBERT
⢠Knowledge distillation is a technique under which smaller model is trained to mimic the
behaviour of the larger model.
⢠It is sometimes called as teacher student learning. Where student is smaller model and
teacher is the bigger model
⢠It was generalized by Hinton et. al.
⢠Student is trained to learned the full distribution of the teacher.
⢠The training of the model has a small change where the student is not trained against the
gold labels but against the probabilities of the teacher
ð¿ = â ð¡ð â log ð ð
68. DistilBERT
⢠How did the model was trained?
⢠The model was trained on the basis of distillation - Leibler divergence score.
⢠It measures the divergence between two probability distribution
ðŸð¿(ð¡||ð ) = ðð â log ðð â ðð â log ðð
⢠The loss is the linear combination of the masked LM loss and distillation loss.
⢠Model parameter changes:
⢠The next sentence classification task objective was dropped compare to original version.
⢠The number of layers was reduced by a factor of two.
⢠Did it affected the performance?
⢠Yes, it did. Still, it can able to retain 95% performance of the original BERT
69. DitilBERT
⢠Another trick to capture the performance of the teacher was âweight initializationâ . The
layerâs weight was initialized with already learned weights from the master.
⢠It was trained on the larger batches with Masked Language model similar to the original
BERT method.
⢠Visualization of attention in DistilBERT
⢠In this case as well the word âchocolateâ was attended by
relevant words like âbiteâ etc.
70. RoBERTa
⢠It was proposed by Facebook AI team.
⢠It is a training strategy to learn a better representation compare to BERT.
⢠Two important points that was compared to the original pre-training.
⢠Static masking vs dynamic masking: In original paper, [MASK] token was statically changed
before training. In this work, the data was duplicated 10 times so that different masking pattern
can be observed several times in same context. This did not improve the result.
⢠Training with higher batch size compared to original pre training lead to a better accuracy.
⢠This model dropped the next sentence prediction objective as compared to original BERT.
71. RoBERTa
⢠This model was trained on huge amount of data ~160GB.
⢠It was trained on a batch size of 8K compared to batch size of 256 in BERT.
⢠Last one, it was trained for longer duration.
⢠It was able to beat the SOTA in different task. On GLUE, it surpass the model XLnet. It
appears that a proper training can give a better result.
72. RoBERTa Visualization
⢠Architecturally it is similar to BERT. Attention head 4 of layer 5 attended the word
âchocolateâ by âlikeâ and âbiteâ. It is similar to human understanding of it.
⢠Layer 4 has captured
similar attention like layer
10 of BERT that we have
already seen.
⢠This presents the view that
different features are
captured at different
layers.
73. sBERT
⢠It is a tuned BERT model to generate better representation for sentences. So that
performance on common similarity measures can be improved.
⢠In seven semantic textual similarity task, even GloVe representation performed better
than the Average of BERT encoding.
⢠This model fine tunes the BERT for sentence similarity.
⢠It is trained using Siamese and triplet network.
⢠In Siamese network, two network with same architecture are place with tied weights.
⢠The model is tuned models are task specific.
⢠Classification task:
⢠In this task the learned representation from the Siamese network on BERT ð¢ ððð ð£ are
concatenated with element wise difference of ð¢ ððð ð£ ð. ð. |ð¢ â ð£|.
74. sBERT
⢠The concatenated representation is multiplied with a matrix ðð¡ . This matrixâs weight is
learned to increase the classification accuracy.
Image source: https://arxiv.org/pdf/1908.10084.pdf
⢠Other two scenarios are
⢠Regression task where last two layers are
replaced with cosine similarity between
two vectors. The objective function is
Mean squared error.
⢠Tripple objective function is applied when
sentence ð has a positive relation with
sentence ð and negative relation with ð
then the Loss function tries to place a
closer to p while farther to q.
75. sBERT training
⢠Model is trained with different hyper parameter and adopted strategies:
⢠Pooling strategy:
⢠In this case they tried to pool BERT embedding for sentence representation using three strategies -
MAX, MEAN and [CLS]
⢠MEAN shows the best performance.
⢠For classification task the vectors ð ððð ð were concatenated in different ways. But they
achieved the best performance when |ð â ð| was concatenated with ð ððð ð.
⢠Observation:
⢠The need of fine tuning is required and task specific.
⢠Transfer learning of BERT can be used. This is fine tuning of BERT for a specific task.
76. Pre-trained models
⢠Most of major models are very huge in nature and take a lot of computational resource to
train. These models are open sourced by researchers or Tech companies.
⢠It enables other researcher to use transfer learning to fine tune it for the task.
⢠Recent model in GPT series (GPT-3) has not been open sourced.
⢠There is another version of ELMo based on transformer architecture has also been
launched.
⢠As the models are getting heavier and heavier, a model by nVIDIA (Megatron LM) has
8300 M parameter!!
⢠So, Word Embeddings are still evolving. But, BERT and ELMo were the âVGG16â moment
for NLP!!!