This document provides a tutorial and survey on attention mechanisms, transformers, BERT, and GPT. It first explains attention in sequence-to-sequence models, including the basic sequence-to-sequence model without attention and with attention. It then explains self-attention and how it allows for computing attention over sequences. The document goes on to describe transformers, which are composed solely of attention modules without recurrence. It explains the encoder and decoder components of transformers, including positional encoding, multi-head attention, and masked multi-head attention. Finally, it introduces BERT and GPT, which are stacks of transformer encoders and decoders, respectively, and explains their characteristics and workings.
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...Eugene Nho
Machine comprehension remains a challenging open area of research. While many question answering models have been explored for existing datasets, little work has been done with the newly released MS MARCO dataset, which mirrors the reality much more closely and poses many unique challenges. We explore an end-to-end neural architecture with attention mechanisms for comprehending relevant information and generating text answers for MS MARCO.
Behavior study of entropy in a digital image through an iterative algorithmijscmcj
Image segmentation is a critical step in computer vision tasks constituting an essential issue for pattern recognition and visual interpretation. In this paper, we study the behavior of entropy in digital images through an iterative algorithm of mean shift filtering. The order of a digital image in gray levels is defined. The behavior of Shannon entropy is analyzed and then compared, taking into account the number of iterations of our algorithm, with the maximum entropy that could be achieved under the same order. The use of equivalence classes it induced, which allow us to interpret entropy as a hyper-surface in real m dimensional space. The difference of the maximum entropy of order n and the entropy of the image is used to group the the iterations, in order to caractrizes the performance of the algorithm.
Some Engg. Applications of Matrices and Partial DerivativesSanjaySingh011996
This document contains a submission by three students to Dr. Sona Raj Mam regarding partial differentiation, matrices and determinants, and eigenvectors and eigenvalues. It provides examples of how these mathematical concepts are applied in fields like engineering. Partial differentiation is used in economics to analyze demand and in image processing for edge detection. Matrices and determinants allow representing linear transformations in graphics software. Eigenvalues and eigenvectors have applications in areas like computer science, smartphone apps, and modeling structures in civil engineering. The document also provides real-world examples and references textbooks and websites for further information.
The document discusses fuzzy logic and its application in traffic modeling. It provides background on fuzzy set theory and fuzzy logic. It then summarizes research using fuzzy logic models for car-following behavior. One study developed a fuzzy logic car-following model using relative speed and distance as inputs and acceleration as the output. The model was validated using real-world driving data. Other research applied fuzzy rule-based models and inference systems to lane changing decisions under heavy traffic conditions.
Review of fuzzy microscopic traffic flow models by abubakar usman atikuAbubakarUsmanAtiku
The document discusses fuzzy logic and its application in traffic modeling. It provides background on fuzzy logic, describing how it allows for imprecise or ambiguous answers between true and false. It then discusses using a fuzzy logic model for car-following behavior, with inputs like relative speed and distance and output of acceleration. Previous research developed fuzzy models based on real-world driving data to better understand driver behavior. The document also reviews literature on fuzzy rule-based traffic models and their use in transportation engineering.
Clays have a tendency to this article first introduces the basic concepts of fuzzy theory, including comparisons between fuzzy sets and traditional explicit sets, fuzzy sets basic operations such as the membership function of the set and the colloquial variable, the intersection and union of the fuzzy set, and use the above concepts to guide into the four basic reasoning mechanisms of fuzzy mode and introduce several common types of fuzzy application examples such as fuzzy washing machine and fuzzy control of incinerator plant in China illustrate the application of fuzzy theory in real society.
Fuzzy linear programming with the intuitionistic polygonal fuzzy numbersIJECEIAES
In real world applications, data are subject to ambiguity due to several factors; fuzzy sets and fuzzy numbers propose a great tool to model such ambiguity. In case of hesitation, the complement of a membership value in fuzzy numbers can be different from the non-membership value, in which case we can model using intuitionistic fuzzy numbers as they provide flexibility by defining both a membership and a non-membership functions. In this article, we consider the intuitionistic fuzzy linear programming problem with intuitionistic polygonal fuzzy numbers, which is a generalization of the previous polygonal fuzzy numbers found in the literature. We present a modification of the simplex method that can be used to solve any general intuitionistic fuzzy linear programming problem after approximating the problem by an intuitionistic polygonal fuzzy number with n edges. This method is given in a simple tableau formulation, and then applied on numerical examples for clarity.
Two-Stage Eagle Strategy with Differential EvolutionXin-She Yang
The document describes a two-stage optimization strategy called the Eagle Strategy (ES) that combines global and local search algorithms to improve search efficiency. It evaluates applying ES to differential evolution (DE), a popular evolutionary algorithm. ES first uses randomization like Levy flights for global exploration, then switches to DE for intensive local search around promising solutions. The authors validate ES-DE on test functions, finding it requires only 9.7-24.9% of the function evaluations of pure DE. They also apply it to real-world pressure vessel and gearbox design problems, achieving solutions with 14.9-17.7% fewer function evaluations than pure DE.
NLP Project: Machine Comprehension Using Attention-Based LSTM Encoder-Decoder...Eugene Nho
Machine comprehension remains a challenging open area of research. While many question answering models have been explored for existing datasets, little work has been done with the newly released MS MARCO dataset, which mirrors the reality much more closely and poses many unique challenges. We explore an end-to-end neural architecture with attention mechanisms for comprehending relevant information and generating text answers for MS MARCO.
Behavior study of entropy in a digital image through an iterative algorithmijscmcj
Image segmentation is a critical step in computer vision tasks constituting an essential issue for pattern recognition and visual interpretation. In this paper, we study the behavior of entropy in digital images through an iterative algorithm of mean shift filtering. The order of a digital image in gray levels is defined. The behavior of Shannon entropy is analyzed and then compared, taking into account the number of iterations of our algorithm, with the maximum entropy that could be achieved under the same order. The use of equivalence classes it induced, which allow us to interpret entropy as a hyper-surface in real m dimensional space. The difference of the maximum entropy of order n and the entropy of the image is used to group the the iterations, in order to caractrizes the performance of the algorithm.
Some Engg. Applications of Matrices and Partial DerivativesSanjaySingh011996
This document contains a submission by three students to Dr. Sona Raj Mam regarding partial differentiation, matrices and determinants, and eigenvectors and eigenvalues. It provides examples of how these mathematical concepts are applied in fields like engineering. Partial differentiation is used in economics to analyze demand and in image processing for edge detection. Matrices and determinants allow representing linear transformations in graphics software. Eigenvalues and eigenvectors have applications in areas like computer science, smartphone apps, and modeling structures in civil engineering. The document also provides real-world examples and references textbooks and websites for further information.
The document discusses fuzzy logic and its application in traffic modeling. It provides background on fuzzy set theory and fuzzy logic. It then summarizes research using fuzzy logic models for car-following behavior. One study developed a fuzzy logic car-following model using relative speed and distance as inputs and acceleration as the output. The model was validated using real-world driving data. Other research applied fuzzy rule-based models and inference systems to lane changing decisions under heavy traffic conditions.
Review of fuzzy microscopic traffic flow models by abubakar usman atikuAbubakarUsmanAtiku
The document discusses fuzzy logic and its application in traffic modeling. It provides background on fuzzy logic, describing how it allows for imprecise or ambiguous answers between true and false. It then discusses using a fuzzy logic model for car-following behavior, with inputs like relative speed and distance and output of acceleration. Previous research developed fuzzy models based on real-world driving data to better understand driver behavior. The document also reviews literature on fuzzy rule-based traffic models and their use in transportation engineering.
Clays have a tendency to this article first introduces the basic concepts of fuzzy theory, including comparisons between fuzzy sets and traditional explicit sets, fuzzy sets basic operations such as the membership function of the set and the colloquial variable, the intersection and union of the fuzzy set, and use the above concepts to guide into the four basic reasoning mechanisms of fuzzy mode and introduce several common types of fuzzy application examples such as fuzzy washing machine and fuzzy control of incinerator plant in China illustrate the application of fuzzy theory in real society.
Fuzzy linear programming with the intuitionistic polygonal fuzzy numbersIJECEIAES
In real world applications, data are subject to ambiguity due to several factors; fuzzy sets and fuzzy numbers propose a great tool to model such ambiguity. In case of hesitation, the complement of a membership value in fuzzy numbers can be different from the non-membership value, in which case we can model using intuitionistic fuzzy numbers as they provide flexibility by defining both a membership and a non-membership functions. In this article, we consider the intuitionistic fuzzy linear programming problem with intuitionistic polygonal fuzzy numbers, which is a generalization of the previous polygonal fuzzy numbers found in the literature. We present a modification of the simplex method that can be used to solve any general intuitionistic fuzzy linear programming problem after approximating the problem by an intuitionistic polygonal fuzzy number with n edges. This method is given in a simple tableau formulation, and then applied on numerical examples for clarity.
Two-Stage Eagle Strategy with Differential EvolutionXin-She Yang
The document describes a two-stage optimization strategy called the Eagle Strategy (ES) that combines global and local search algorithms to improve search efficiency. It evaluates applying ES to differential evolution (DE), a popular evolutionary algorithm. ES first uses randomization like Levy flights for global exploration, then switches to DE for intensive local search around promising solutions. The authors validate ES-DE on test functions, finding it requires only 9.7-24.9% of the function evaluations of pure DE. They also apply it to real-world pressure vessel and gearbox design problems, achieving solutions with 14.9-17.7% fewer function evaluations than pure DE.
An Algorithm For Vector Quantizer DesignAngie Miller
The document presents an algorithm for designing vector quantizers. The algorithm is efficient, intuitive, and can be used for quantizers with general distortion measures and large block lengths. It is based on Lloyd's approach but does not require differentiation, making it applicable even when the data distribution has discrete components. The algorithm finds quantizers that meet necessary optimality conditions. Examples show it converges well and finds near-optimal quantizers for memoryless Gaussian sources. It is also used successfully to quantize LPC speech parameters with a complicated distortion measure.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATIONcscpconf
In this paper, we propose a new text categorization framework based on Concepts Lattice and cellular automata. In this framework, concept structure are modeled by a Cellular Automaton for Symbolic Induction (CASI). Our objective is to reduce time categorization caused by the Concept Lattice. We examine, by experiments the performance of the proposed approach and compare it with other algorithms such as Naive Bayes and k nearest neighbors. The results show performance improvement while reducing time categorization.
Image encryption technique incorporating wavelet transform and hash integrityeSAT Journals
Abstract
This paper is basically designed for image encryption using wavelet Transform Techniques and its integrity incorporating hash value with SHA-256. Techniques which is involved in encryption is image confusion, image diffusion, wavelet Transform, Inverse wavelet Transform and finally hash value computation of original image. Techniques which are involved for Decryption is reverse of Encryption.
Keywords: wavelet Transform, Hash value, Encryption, Decryption.
IRJET - Application of Fuzzy Logic: A ReviewIRJET Journal
This document provides a review of the application of fuzzy logic. It begins with an abstract that introduces fuzzy logic as a way to organize ideas that cannot be precisely defined but depend on context, and notes its widespread use in fields with uncertainty like business, medicine, engineering, and behavioral sciences.
The main body of the document then discusses the concepts and foundations of fuzzy logic, including fuzzy sets, linguistic variables, and fuzzy inference systems. It provides examples of how fuzzy logic has been applied successfully in various domains like chemical science, healthcare, and agriculture to deal with nonlinear, uncertain systems. Specific applications mentioned include controlling the pH of wastewater and developing a real-time drug distribution system for open-heart patients.
The document summarizes a proposed fuzzy logic-based joint space path planning system for a 3 degree-of-freedom robot manipulator. The system is composed of three separate fuzzy logic units that each control one of the manipulator joints. The inputs and outputs of each fuzzy block control the change in joint position for each time step. Simulation results show the robot is able to reach the goal configuration successfully using this approach. The fuzzy logic method is able to meet real-time requirements for robot motion planning without requiring an exact model of the robot.
This document provides an overview of the Word2Vec deep learning technique for generating word embeddings from large text corpora. It begins with an introduction to deep learning applications in biotechnology. The document then covers the traditional one-hot encoding representation of words and its limitations. It introduces Word2Vec as a method to map words to vectors of continuous values such that similar words have similar vectors. Key aspects covered include the skip-gram architecture, negative sampling, and training Word2Vec models on large datasets. Applications to materials science literature are discussed. Finally, potential project ideas involving applying Word2Vec to biological literature and genomes are proposed.
SPECIFICATION OF THE STATE’S LIFETIME IN THE DEVS FORMALISM BY FUZZY CONTROLLERijait
This paper aims to develop a new approach to assess the duration of state in the DEVS formalism by fuzzy
controller. The idea is to define a set of fuzzy rules obtained from observers or expert knowledge and to
specify a fuzzy model which computes this duration, this latter is fed into the simulator to specify the new
value in the model. In conventional model, each state is defined by a mean lifetime value whereas our
method, calculates for each state the new lifetime according to inputs values. A wildfire case study is
presented at the end of the paper. It is a challenging task due to its complex behavior, dynamical weather
condition, and various variables involved. A global specification of the fuzzy controller and the forest fire
model are presented in the DEVS formalism and comparison between conventional and fuzzy method is
illustrated.
On fuzzy concepts in engineering ppt. ncceSurender Singh
This document discusses fuzzy concepts and their applications in engineering. It begins with an introduction to fuzzy sets and fuzzy logic as extensions of crisp/classical sets and logic. Examples are provided to illustrate fuzzy membership functions. The key components of a fuzzy logic system are described. An example is given of building a fuzzy controller for room temperature regulation. Various engineering terms that are often used fuzzily, like bandwidth and errors, are listed. Applications of fuzzy logic in various fields like controls, scheduling, and signal analysis are outlined. Three probabilistic divergence measures and their fuzzy analogues are presented. Finally, a model for strategic decision making using these divergence measures is proposed and an illustrative example is provided.
A novel technique for speech encryption based on k-means clustering and quant...journalBEEI
This document proposes a new algorithm for speech encryption that uses quantum chaotic maps, k-means clustering, and two stages of scrambling. The first stage uses a tent map to scramble bits in the binary representation of the signal. The second stage uses k-means clustering to scramble blocks of the signal. A quantum logistic map is used to generate an encryption key. The proposed method is evaluated using statistical quality metrics and is shown to provide secure and efficient speech encryption while maintaining high quality of recovered speech.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
The document presents a novel approach for forward error correction (FEC) decoding based on the belief propagation (BP) algorithm in LTE and WiMAX systems. Specifically, it proposes representing tail-biting convolutional codes and turbo codes using parity check matrices, which allows both code types to be decoded using a unified BP algorithm. This provides a lower complexity decoding architecture compared to traditional approaches. Simulation results show the BP algorithm achieves near-identical performance to maximum a posteriori (MAP) decoding for turbo codes, while being less complex. Representing codes with parity check matrices thus enables a universal decoder for LTE and WiMAX using a single BP algorithm.
A Mixed Binary-Real NSGA II Algorithm Ensuring Both Accuracy and Interpretabi...IJECEIAES
In this work, a Neuro-Fuzzy Controller network, called NFC that implements a Mamdani fuzzy inference system is proposed. This network includes neurons able to perform fundamental fuzzy operations. Connections between neurons are weighted through binary and real weights. Then a mixed binaryreal Non dominated Sorting Genetic Algorithm II (NSGA II) is used to perform both accuracy and interpretability of the NFC by minimizing two objective functions; one objective relates to the number of rules, for compactness, while the second is the mean square error, for accuracy. In order to preserve interpretability of fuzzy rules during the optimization process, some constraints are imposed. The approach is tested on two control examples: a single input single output (SISO) system and a multivariable (MIMO) system.
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONijaia
In natural language processing, attention mechanism in neural networks are widely utilized. In this paper, the research team explore a new mechanism of extending output attention in recurrent neural networks for dialog systems. The new attention method was compared with the current method in generating dialog sentence using a real dataset. Our architecture exhibits several attractive properties such as better handle long sequences and, it could generate more reasonable replies in many cases.
Feature Based watermarking algorithm for Image Authentication using D4 Wavele...sipij
In this paper we propose a new watermarking schema i.e. the combination of color space and wavelet transform. Watermarking is a technique that authenticates a digital picture by hiding the secret information into the image. Now, a lot of algorithms and methods have been developed for greyscale images but the particularities of color spaces have to be studied. On the other hand, the wavelet transform allows different possibilities of integrating a mark because of the uses of different parameters: the scale of decomposition, size, shape and localisation of the mark, and the used color space, etc. The RGB (Red, Green, and Blue) values of each pixel of the host color image as well as the color key image are converted to HSV (Hue, Saturation, and Value) values. Then the orthogonal D4 Wavelet transform is applied in each plate of host image and key image. Now insert the key component into appropriate blocks of host image’s different plates. The experimental results show the effectiveness of our algorithm.
Compressing the dependent elements of multisetIRJET Journal
This document discusses compressing the dependent elements of multisets through lossless data compression algorithms. It begins with an introduction to multiset theory and lossless data compression. The authors propose a lossless compression algorithm that treats the dependent and independent elements of a multiset differently, compressing each into tree-based arithmetic codes. The algorithm compresses and decompresses the multiset elements based on set operations performed on the multisets like union, intersection, sum, and difference. This allows both dependent and independent elements of a multiset to be compressed and decompressed simultaneously through generating different codes for each type of element. The document reviews related work applying multiset theory in other domains and representations before describing the proposed solution and results.
IRJET- Image Captioning using Multimodal EmbeddingIRJET Journal
This document proposes a novel methodology to generate a single story or caption from multiple images that share similar context. It combines existing image captioning and natural language processing models. Specifically, it uses a Convolutional Neural Network to extract visual semantics from images and generate captions. It then represents the captions as vectors using Skip Thought vectors or TF-IDF values. These vectors are combined into a matrix and used to generate a new story/caption that shares the context of the input images. The results show that the Skip Thought vector approach achieves better performance based on RMSE and MAE error metrics. The model could potentially be applied to applications like medical diagnosis, crime investigations, and lecture note generation.
The document discusses convolution and its applications in digital signal processing. It begins with an introduction to convolution and its mathematical definitions for both continuous and discrete time signals. It then discusses various types of convolution including linear and circular convolution. The properties of convolution such as commutativity, associativity and distributivity are also covered. Applications of convolution in areas such as statistics, optics, acoustics, electrical engineering and digital signal processing are summarized. Finally, the document discusses symmetric convolution and its advantages over traditional convolution methods.
The document summarizes the paper "Matching Networks for One Shot Learning". It discusses one-shot learning, where a classifier can learn new concepts from only one or a few examples. It introduces matching networks, a new approach that trains an end-to-end nearest neighbor classifier for one-shot learning tasks. The matching networks architecture uses an attention mechanism to compare a test example to a small support set and achieve state-of-the-art one-shot accuracy on Omniglot and other datasets. The document provides background on one-shot learning challenges and related work on siamese networks, memory augmented neural networks, and attention mechanisms.
Our project is about guessing the correct missing
word in a given sentence. To find of guess the missing word
we have two main methods one of them statistical language
modeling, while the other is neural language models.
Statistical language modeling depend on the frequency of the
relation between words and here we use Markov chain. Since
neural language models uses artificial neural networks which
uses deep learning, here we use BERT which is the state of art
in language modeling provided by google.
Current Ms word generated power point presentation covers major details about the micronuclei test. It's significance and assays to conduct it. It is used to detect the micronuclei formation inside the cells of nearly every multicellular organism. It's formation takes place during chromosomal sepration at metaphase.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
An Algorithm For Vector Quantizer DesignAngie Miller
The document presents an algorithm for designing vector quantizers. The algorithm is efficient, intuitive, and can be used for quantizers with general distortion measures and large block lengths. It is based on Lloyd's approach but does not require differentiation, making it applicable even when the data distribution has discrete components. The algorithm finds quantizers that meet necessary optimality conditions. Examples show it converges well and finds near-optimal quantizers for memoryless Gaussian sources. It is also used successfully to quantize LPC speech parameters with a complicated distortion measure.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATIONcscpconf
In this paper, we propose a new text categorization framework based on Concepts Lattice and cellular automata. In this framework, concept structure are modeled by a Cellular Automaton for Symbolic Induction (CASI). Our objective is to reduce time categorization caused by the Concept Lattice. We examine, by experiments the performance of the proposed approach and compare it with other algorithms such as Naive Bayes and k nearest neighbors. The results show performance improvement while reducing time categorization.
Image encryption technique incorporating wavelet transform and hash integrityeSAT Journals
Abstract
This paper is basically designed for image encryption using wavelet Transform Techniques and its integrity incorporating hash value with SHA-256. Techniques which is involved in encryption is image confusion, image diffusion, wavelet Transform, Inverse wavelet Transform and finally hash value computation of original image. Techniques which are involved for Decryption is reverse of Encryption.
Keywords: wavelet Transform, Hash value, Encryption, Decryption.
IRJET - Application of Fuzzy Logic: A ReviewIRJET Journal
This document provides a review of the application of fuzzy logic. It begins with an abstract that introduces fuzzy logic as a way to organize ideas that cannot be precisely defined but depend on context, and notes its widespread use in fields with uncertainty like business, medicine, engineering, and behavioral sciences.
The main body of the document then discusses the concepts and foundations of fuzzy logic, including fuzzy sets, linguistic variables, and fuzzy inference systems. It provides examples of how fuzzy logic has been applied successfully in various domains like chemical science, healthcare, and agriculture to deal with nonlinear, uncertain systems. Specific applications mentioned include controlling the pH of wastewater and developing a real-time drug distribution system for open-heart patients.
The document summarizes a proposed fuzzy logic-based joint space path planning system for a 3 degree-of-freedom robot manipulator. The system is composed of three separate fuzzy logic units that each control one of the manipulator joints. The inputs and outputs of each fuzzy block control the change in joint position for each time step. Simulation results show the robot is able to reach the goal configuration successfully using this approach. The fuzzy logic method is able to meet real-time requirements for robot motion planning without requiring an exact model of the robot.
This document provides an overview of the Word2Vec deep learning technique for generating word embeddings from large text corpora. It begins with an introduction to deep learning applications in biotechnology. The document then covers the traditional one-hot encoding representation of words and its limitations. It introduces Word2Vec as a method to map words to vectors of continuous values such that similar words have similar vectors. Key aspects covered include the skip-gram architecture, negative sampling, and training Word2Vec models on large datasets. Applications to materials science literature are discussed. Finally, potential project ideas involving applying Word2Vec to biological literature and genomes are proposed.
SPECIFICATION OF THE STATE’S LIFETIME IN THE DEVS FORMALISM BY FUZZY CONTROLLERijait
This paper aims to develop a new approach to assess the duration of state in the DEVS formalism by fuzzy
controller. The idea is to define a set of fuzzy rules obtained from observers or expert knowledge and to
specify a fuzzy model which computes this duration, this latter is fed into the simulator to specify the new
value in the model. In conventional model, each state is defined by a mean lifetime value whereas our
method, calculates for each state the new lifetime according to inputs values. A wildfire case study is
presented at the end of the paper. It is a challenging task due to its complex behavior, dynamical weather
condition, and various variables involved. A global specification of the fuzzy controller and the forest fire
model are presented in the DEVS formalism and comparison between conventional and fuzzy method is
illustrated.
On fuzzy concepts in engineering ppt. ncceSurender Singh
This document discusses fuzzy concepts and their applications in engineering. It begins with an introduction to fuzzy sets and fuzzy logic as extensions of crisp/classical sets and logic. Examples are provided to illustrate fuzzy membership functions. The key components of a fuzzy logic system are described. An example is given of building a fuzzy controller for room temperature regulation. Various engineering terms that are often used fuzzily, like bandwidth and errors, are listed. Applications of fuzzy logic in various fields like controls, scheduling, and signal analysis are outlined. Three probabilistic divergence measures and their fuzzy analogues are presented. Finally, a model for strategic decision making using these divergence measures is proposed and an illustrative example is provided.
A novel technique for speech encryption based on k-means clustering and quant...journalBEEI
This document proposes a new algorithm for speech encryption that uses quantum chaotic maps, k-means clustering, and two stages of scrambling. The first stage uses a tent map to scramble bits in the binary representation of the signal. The second stage uses k-means clustering to scramble blocks of the signal. A quantum logistic map is used to generate an encryption key. The proposed method is evaluated using statistical quality metrics and is shown to provide secure and efficient speech encryption while maintaining high quality of recovered speech.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
The document presents a novel approach for forward error correction (FEC) decoding based on the belief propagation (BP) algorithm in LTE and WiMAX systems. Specifically, it proposes representing tail-biting convolutional codes and turbo codes using parity check matrices, which allows both code types to be decoded using a unified BP algorithm. This provides a lower complexity decoding architecture compared to traditional approaches. Simulation results show the BP algorithm achieves near-identical performance to maximum a posteriori (MAP) decoding for turbo codes, while being less complex. Representing codes with parity check matrices thus enables a universal decoder for LTE and WiMAX using a single BP algorithm.
A Mixed Binary-Real NSGA II Algorithm Ensuring Both Accuracy and Interpretabi...IJECEIAES
In this work, a Neuro-Fuzzy Controller network, called NFC that implements a Mamdani fuzzy inference system is proposed. This network includes neurons able to perform fundamental fuzzy operations. Connections between neurons are weighted through binary and real weights. Then a mixed binaryreal Non dominated Sorting Genetic Algorithm II (NSGA II) is used to perform both accuracy and interpretability of the NFC by minimizing two objective functions; one objective relates to the number of rules, for compactness, while the second is the mean square error, for accuracy. In order to preserve interpretability of fuzzy rules during the optimization process, some constraints are imposed. The approach is tested on two control examples: a single input single output (SISO) system and a multivariable (MIMO) system.
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONijaia
In natural language processing, attention mechanism in neural networks are widely utilized. In this paper, the research team explore a new mechanism of extending output attention in recurrent neural networks for dialog systems. The new attention method was compared with the current method in generating dialog sentence using a real dataset. Our architecture exhibits several attractive properties such as better handle long sequences and, it could generate more reasonable replies in many cases.
Feature Based watermarking algorithm for Image Authentication using D4 Wavele...sipij
In this paper we propose a new watermarking schema i.e. the combination of color space and wavelet transform. Watermarking is a technique that authenticates a digital picture by hiding the secret information into the image. Now, a lot of algorithms and methods have been developed for greyscale images but the particularities of color spaces have to be studied. On the other hand, the wavelet transform allows different possibilities of integrating a mark because of the uses of different parameters: the scale of decomposition, size, shape and localisation of the mark, and the used color space, etc. The RGB (Red, Green, and Blue) values of each pixel of the host color image as well as the color key image are converted to HSV (Hue, Saturation, and Value) values. Then the orthogonal D4 Wavelet transform is applied in each plate of host image and key image. Now insert the key component into appropriate blocks of host image’s different plates. The experimental results show the effectiveness of our algorithm.
Compressing the dependent elements of multisetIRJET Journal
This document discusses compressing the dependent elements of multisets through lossless data compression algorithms. It begins with an introduction to multiset theory and lossless data compression. The authors propose a lossless compression algorithm that treats the dependent and independent elements of a multiset differently, compressing each into tree-based arithmetic codes. The algorithm compresses and decompresses the multiset elements based on set operations performed on the multisets like union, intersection, sum, and difference. This allows both dependent and independent elements of a multiset to be compressed and decompressed simultaneously through generating different codes for each type of element. The document reviews related work applying multiset theory in other domains and representations before describing the proposed solution and results.
IRJET- Image Captioning using Multimodal EmbeddingIRJET Journal
This document proposes a novel methodology to generate a single story or caption from multiple images that share similar context. It combines existing image captioning and natural language processing models. Specifically, it uses a Convolutional Neural Network to extract visual semantics from images and generate captions. It then represents the captions as vectors using Skip Thought vectors or TF-IDF values. These vectors are combined into a matrix and used to generate a new story/caption that shares the context of the input images. The results show that the Skip Thought vector approach achieves better performance based on RMSE and MAE error metrics. The model could potentially be applied to applications like medical diagnosis, crime investigations, and lecture note generation.
The document discusses convolution and its applications in digital signal processing. It begins with an introduction to convolution and its mathematical definitions for both continuous and discrete time signals. It then discusses various types of convolution including linear and circular convolution. The properties of convolution such as commutativity, associativity and distributivity are also covered. Applications of convolution in areas such as statistics, optics, acoustics, electrical engineering and digital signal processing are summarized. Finally, the document discusses symmetric convolution and its advantages over traditional convolution methods.
The document summarizes the paper "Matching Networks for One Shot Learning". It discusses one-shot learning, where a classifier can learn new concepts from only one or a few examples. It introduces matching networks, a new approach that trains an end-to-end nearest neighbor classifier for one-shot learning tasks. The matching networks architecture uses an attention mechanism to compare a test example to a small support set and achieve state-of-the-art one-shot accuracy on Omniglot and other datasets. The document provides background on one-shot learning challenges and related work on siamese networks, memory augmented neural networks, and attention mechanisms.
Our project is about guessing the correct missing
word in a given sentence. To find of guess the missing word
we have two main methods one of them statistical language
modeling, while the other is neural language models.
Statistical language modeling depend on the frequency of the
relation between words and here we use Markov chain. Since
neural language models uses artificial neural networks which
uses deep learning, here we use BERT which is the state of art
in language modeling provided by google.
Current Ms word generated power point presentation covers major details about the micronuclei test. It's significance and assays to conduct it. It is used to detect the micronuclei formation inside the cells of nearly every multicellular organism. It's formation takes place during chromosomal sepration at metaphase.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...AbdullaAlAsif1
The pygmy halfbeak Dermogenys colletei, is known for its viviparous nature, this presents an intriguing case of relatively low fecundity, raising questions about potential compensatory reproductive strategies employed by this species. Our study delves into the examination of fecundity and the Gonadosomatic Index (GSI) in the Pygmy Halfbeak, D. colletei (Meisner, 2001), an intriguing viviparous fish indigenous to Sarawak, Borneo. We hypothesize that the Pygmy halfbeak, D. colletei, may exhibit unique reproductive adaptations to offset its low fecundity, thus enhancing its survival and fitness. To address this, we conducted a comprehensive study utilizing 28 mature female specimens of D. colletei, carefully measuring fecundity and GSI to shed light on the reproductive adaptations of this species. Our findings reveal that D. colletei indeed exhibits low fecundity, with a mean of 16.76 ± 2.01, and a mean GSI of 12.83 ± 1.27, providing crucial insights into the reproductive mechanisms at play in this species. These results underscore the existence of unique reproductive strategies in D. colletei, enabling its adaptation and persistence in Borneo's diverse aquatic ecosystems, and call for further ecological research to elucidate these mechanisms. This study lends to a better understanding of viviparous fish in Borneo and contributes to the broader field of aquatic ecology, enhancing our knowledge of species adaptations to unique ecological challenges.
20240520 Planning a Circuit Simulator in JavaScript.pptx
Transformer_tutorial.pdf
1. To appear as a part of Prof. Ali Ghodsi’s material on deep learning.
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
Benyamin Ghojogh BGHOJOGH@UWATERLOO.CA
Department of Electrical and Computer Engineering,
Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
Ali Ghodsi ALI.GHODSI@UWATERLOO.CA
Department of Statistics and Actuarial Science & David R. Cheriton School of Computer Science,
Data Analytics Laboratory, University of Waterloo, Waterloo, ON, Canada
Abstract
This is a tutorial and survey paper on the atten-
tion mechanism, transformers, BERT, and GPT.
We first explain attention mechanism, sequence-
to-sequence model without and with attention,
self-attention, and attention in different areas
such as natural language processing and com-
puter vision. Then, we explain transformers
which do not use any recurrence. We ex-
plain all the parts of encoder and decoder in
the transformer, including positional encoding,
multihead self-attention and cross-attention, and
masked multihead attention. Thereafter, we in-
troduce the Bidirectional Encoder Representa-
tions from Transformers (BERT) and Generative
Pre-trained Transformer (GPT) as the stacks of
encoders and decoders of transformer, respec-
tively. We explain their characteristics and how
they work.
1. Introduction
When looking at a scene or picture, our visual system, so as
a machine learning model (Li et al., 2019b), focuses on or
attends to some specific parts of the scene/image with more
information and importance and ignores the less informa-
tive or less important parts. For example, when we look at
the Mona Lisa portrait, our visual system attends to Mona
Lisa’s face and smile, as Fig. 1 illustrates. Moreover, when
reading a text, especially when we want to try fast read-
ing, one technique is skimming (Xu, 2011) in which our
visual system or a model skims the data with high pacing
and only attends to more informative words of sentences
(Yu et al., 2018). Figure 1 shows a sample sentence and
highlights the words to which our visual system focuses
Figure 1. Attention in visual system for (a) seeing a picture by
attending to more important parts of scene and (b) reading a sen-
tence by attending to more informative words in the sentence.
more in skimming.
The concept of attention can be modeled in machine learn-
ing where attention is a simple weighting of data. In the
attention mechanism, explained in this tutorial paper, the
more informative or more important parts of data are given
larger weights for the sake of more attention. Many of the
state-of-the-art Natural Language Processing (NLP) (In-
durkhya & Damerau, 2010) and deep learning techniques
in NLP (Socher et al., 2012).
Transformers are also autoencoders which encode the input
data to a hidden space and then decodes those to another
domain. Transfer learning is widely used in NLP (Wolf
et al., 2019b). Transformers can also be used for trans-
fer learning. Recently, transformers were proposed merely
composed of attention modules, excluding recurrence and
any recurrent modules (Vaswani et al., 2017). This was a
great breakthrough. Prior to the proposal of transformers
with only attention mechanism, recurrent models such as
Long-Short Term Memory (LSTM) (Hochreiter & Schmid-
2. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 2
huber, 1997) and Recurrent Neural Network (RNN) (Kom-
brink et al., 2011) were mostly used for NLP. Also, other
NLP models such as word2vec (Mikolov et al., 2013a;b;
Goldberg & Levy, 2014; Mikolov et al., 2015) and GloVe
(Pennington et al., 2014) were state-of-the-art before the
appearance of transformers. Recently, two NLP methods,
named Bidirectional Encoder Representations from Trans-
formers (BERT) and Generative Pre-trained Transformer
(GPT), are proposed which are the stacks of encoders and
decoders of transformer, respectively.
In this paper, we introduce and review the attention mech-
anism, transformers, BERT, and GPT. In Section 2, we in-
troduce the sequence-to-sequence model with and without
attention mechanism as well as the self-attention. Section 3
explains the different parts of encoder and decoder of trans-
formers. BERT and its different variants of BERT are in-
troduced in Section 4 while GPT and its variants are intro-
duced in Section 5. Finally, Section 6 concludes the paper.
2. Attention Mechanism
2.1. Autoencoder and the Context Vector
Consider an autoencoder with encoder and decoder parts
where the encoder gets an input and converts it to a context
vector and the decoder gets the context vector and converts
to an output. The output is related to the input through
the context vector in the so-called hidden space. Figure 3-
a illustrates an autoencoder with the encoder (left part of
autoencoder), hidden space (in the middle), and decoder
(right part of autoencoder) parts. For example, the input
can be a sentence or a word in English and the output is
the same sentence or word but in French. Assume the word
“elephant” in English is fed to the encoder and the word
“l’éléphant” in French is output. The context vector mod-
els the concept of elephant which also exists in the mind
of human when thinking to elephant. This context is ab-
stract in mind and can be referred to any fat, thin, huge,
or small elephant (Perlovsky, 2006). Another example for
transformer is transforming a cartoon image of elephant to
picture of a real elephant (see Fig. 2). As the autoencoder
is transforming data from a domain to a hidden space and
then to another domain, it can be used for domain transfor-
mation (Wang et al., 2020), domain adaptation (Ben-David
et al., 2010), and domain generalization (Dou et al., 2019).
Here, every context is modeled as a vector in the hidden
space. Let the context vector be denoted by c ∈ Rp
in the
p-dimensional hidden space.
2.2. The Sequence-to-Sequence Model
Consider a sequence of ordered tokens, e.g., a sequence
of words which make a sentence. We want to transform
this sequence to another related sequence. For example,
we want to take a sentence in English and translate it to
the same sentence in French. This model which trans-
Figure 2. Using an autoencoder for Transformation of one domain
to another domain. The used images are taken from the PACS
dataset (Li et al., 2017).
forms a sequence to another related sequence is named the
sequence-to-sequence model (Bahdanau et al., 2015).
Suppose the number of words in a document or
considered sentence be n. Let the ordered in-
put tokens or words of the sequence be denoted by
{x1, x2, . . . , xi−1, xi, . . . , xn} and the output sequence
be denoted by {y1, y2, . . . , yi−1, yi, . . . , yn}. As Fig. 3-a
illustrates, there exist latent vectors, denoted by {li}n
i=1,
in the decoder part for every word. In the sequence-to-
sequence model, the probability of generation of the i-th
word conditioning on all the previous words is determined
by a function g(.) whose inputs are the immediate previous
word yi−1, the i-th latent vector li, and the context vector
c:
P(yi|y1, . . . , yi−1) = g(yi−1, li, c). (1)
Figure 3-a depicts the sequence-to-sequence model. In the
sequence-to-sequence model, every word xi produces a
hidden vector hi in the encoder part of the autoencoder.
The hidden vector of every word, hi, is fed to the next hid-
den vector, hi+1, by a projection matrix W . In this model,
for the whole sequence, there is only one context vector
c which is equal to the last hidden vector of the encoder,
i.e., c = hn. Note that the encoder and decoder in the
sequence-to-sequence model can be any sequential model
such as RNN (Kombrink et al., 2011) or LSTM (Hochreiter
& Schmidhuber, 1997).
2.3. The Sequence-to-Sequence Model with Attention
The explained sequence-to-sequence model can be with at-
tention (Chorowski et al., 2014; Luong et al., 2015). In the
sequence-to-sequence model with attention, the probability
of generation of the i-th word is determined as (Chorowski
et al., 2014):
P(yi|y1, . . . , yi−1) = g(yi−1, li, ci). (2)
Figure 3-b shows the sequence-to-sequence model with
attention. In this model, in contrast to the sequence-to-
sequence model which has only one context vector for the
whole sequence, this model has a context vector for every
word. The context vector of every word is a linear combi-
nation, or weighted sum, of all the hidden vectors; hence,
3. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 3
Figure 3. Sequence-to-sequence model (a) with and (b) without attention.
the i-th context vector is:
ci =
n
X
j=1
aijhj, (3)
where aij ≥ 0 is the weight of hj for the i-th context
vector. This weighted sum for a specific word in the se-
quence determines which words in the sequence have more
effect on that word. In other words, it determines which
words this specific word “attends” to more. This notion of
weighted impact, to see which parts have more impact, is
called “attention”. It is noteworthy that the original idea of
arithmetic linear combination of vectors for the purpose of
word embedding, similar to Eq. (3), was in the Word2Vec
method (Mikolov et al., 2013a;b).
The sequence-to-sequence model with attention considers
a notion of similarity between the latent vector li−1 of the
decoder and the hidden vector hj of the encoder (Bahdanau
et al., 2015):
R 3 sij := similarity(li−1, hj). (4)
The intuition for this similarity score is as follows. The
output word yi depends on the previous latent vector li−1
(see Fig. 3) and and the hidden vector hj depends son the
input word xj. Hence, this similarity score relates to the
impact of the input xj on the output yi. In this way, the
score sij shows the impact of the i-th word to generate the
j-th word in the sequence. This similarity notion can be
a neural network learned by backpropagation (Rumelhart
et al., 1986).
In order to make this score a probability, these scores
should sum to one; hence, we make its softmax form as
(Chorowski et al., 2014):
R 3 aij :=
esij
Pn
k=1 esik
. (5)
In this way, the score vector [ai1, ai2, . . . , ain]>
behaves
as a discrete probability distribution. Therefore, In Eq. (3),
the weights sum to one and the weights with higher values
attend more to their corresponding hidden vectors.
2.4. Self-Attention
2.4.1. THE NEED FOR COMPOSITE EMBEDDING
Many of the previous methods for NLP, such as word2vec
(Mikolov et al., 2013a;b; Goldberg & Levy, 2014; Mikolov
et al., 2015) and GloVe (Pennington et al., 2014), used to
learn a representation for every word. However, for un-
derstanding how the words relate to each other, we can
have a composite embedding where the compositions of
words also have some embedding representation (Cheng
et al., 2016). For example, Fig. 4 shows a sentence which
highlights the relation of words. This figure shows, when
reading a word in a sentence, which previous words in the
sentence we remember more. This relation of words shows
that we need to have a composite embedding for natural
language embedding.
2.4.2. QUERY-RETRIEVAL MODELING
Consider a database with keys and values where a query
is searched through the keys to retrieve a value (Garcia-
Molina et al., 1999). Figure 5 shows such database. We
4. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 4
Figure 4. The relation of words in a sentence raising the need for
composite embedding. The credit of this image for (Cheng et al.,
2016).
Figure 5. Query-retrieval in a database.
can generalize this hard definition of query-retrieval to a
soft query-retrieval where several keys, rather than only
one key, can be corresponded to the query. For this, we
calculate the similarity of the query with all the keys to see
which keys are more similar to the query. This soft query-
retrieval is formulated as:
attention(q, {ki}n
i=1, {vi}n
i=1) :=
n
X
i=1
aivi, (6)
where:
R 3 si := similarity(q, ki), (7)
R 3 ai := softmax(si) =
esi
Pn
k=1 esk
, (8)
and q, {ki}n
i=1, and {vi}n
i=1 denote the query, keys, and
values, respectively. Recall that the the context vector of
a sequence-to-sequence model with attention, introduced
by Eq. (3), was also a linear combination with weights of
normalized similarity (see Eq. (5)). The same linear com-
bination is the Eq. (6) where the weights are the similarity
of query with the keys. An illustration of Eqs. (6), (7), and
(8) is shown in Fig. 6. Note that the similarity si can be
any notion of similarity. Some of the well-known similarity
Figure 6. Illustration of Eqs. (6), (7), and (8) in attention mecha-
nism. In this example, it is assumed there exist five (four keys and
one query) words in the sequence.
measures are (Vaswani et al., 2017):
inner product: si = q>
ki, (9)
scaled inner product: si =
q>
ki
√
p
, (10)
general inner product: si = q>
W ki, (11)
additive similarity: si = w>
q q + w>
k ki, (12)
where W ∈ Rp×p
, wq ∈ Rp
, and wk ∈ Rp
are some
learn-able matrices and vectors. Among these similarity
measures, the scaled inner product is used most.
The Eq. (6) calculates the attention of a target word (or
query) with respect to every input word (or keys) which are
the previous and forthcoming words. As Fig. 4 illustrates,
when processing a word which is considered as the query,
the other words in the sequence are the keys. Using Eq.
(6), we see how similar the other words of the sequence
are to that word. In other words, we see how impactful the
other previous and forthcoming words are for generating a
missing word in the sequence.
We provide an example for Eq. (6), here. Consider a sen-
tence “I am a student”. Assume we are processing the word
“student” in this sequence. Hence, we have a query cor-
responding to the word “student”. The values are corre-
sponding to the previous words which are “I”, “am”, and
“a”. Assume we calculate the normalized similarity of the
query and the values and obtain the weights 0.7, 0.2, and
0.1 for “I”, “am”, and “a”, respectively, where the weights
sum to one. Then, the attention value for the word “stu-
dent” is 0.7vI + 0.2vam + 0.1va.
2.4.3. ATTENTION FORMULATION
Let the words of a sequence of words be in a d-dimensional
space, i.e., the sequence is {xi ∈ Rd
}n
i=1. This d-
5. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 5
dimensional representation of words can be taken from any
word-embedding NLP method such as word2vec (Mikolov
et al., 2013a;b; Goldberg & Levy, 2014; Mikolov et al.,
2015) and GloVe (Pennington et al., 2014). The query, key,
and value are projection of the words into p-dimensional,
p-dimensional, and r-dimensional subspaces, respectively:
Rp
3 qi = W >
Q xi, (13)
Rp
3 ki = W >
K xi, (14)
Rr
3 vi = W >
V xi, (15)
where W Q ∈ Rd×p
, W K ∈ Rd×p
, and W V ∈
Rd×r
are the projection matrices into the low-dimensional
query, key, and value subspaces, respectively. Consider
the words, queries, keys, and values in matrix forms as
X := [x1, . . . , xn] ∈ Rd×n
, Q := [q1, . . . , qn] ∈ Rp×n
,
K := [k1, . . . , kn] ∈ Rp×n
, and V := [v1, . . . , vn] ∈
Rr×n
, respectively. It is noteworthy that in the similar-
ity measures, such as the scaled inner product, we have
the inner product q>
ki. Hence, the similarity measures
contain q>
ki = x>
i W Q W >
K xi. Note that if we had
W Q = W K, this would be a kernel matrix so its be-
haviour is similar to the kernel for measuring similarity.
Considering Eqs. (7), and (8) and the above definitions,
the Eq. (6) can be written in matrix form, for the whole
sequence of n words, as:
Rr×n
3 Z := attention(Q, K, V )
= V softmax(
1
√
p
Q>
K),
(16)
where Z = [z1, . . . , zn] is the attention values, for all the
words, which shows how much every word attends to its
previous and forthcoming words. In Eq. (16), the softmax
operator applies the softmax function on every row of its
input matrix so that every row sums to one.
Note that as the queries, keys, and values are all from the
same words in the sequence, this attention is referred to as
the “self-attention” (Cheng et al., 2016).
2.5. Attention in Other Fields Such as Vision and
Speech
Note that the concept of attention can be used in any field of
research and not merely in NLP. The attention concept has
widely been used in NLP (Chorowski et al., 2014; Luong
et al., 2015). Attention can be used in the field of com-
puter vision (Xu et al., 2015). Attention in computer vision
means attending to specific parts of image which are more
important and informative (see Fig. 1). This simulates at-
tention and exception in human visual system (Summer-
field & Egner, 2009) where our brain filters the observed
scene to focus on its important parts.
For example, we can generate captions for an input image
using an autoencoder, illustrated in Fig. 8-a, with the at-
tention mechanism. As this figure shows, the encoder is a
Figure 7. The low-level and high-level features learned in the low
and high convolutional layers, respectively. The credit of this im-
age is for (Lee et al., 2009).
Convolutional Neural Network (CNN) (LeCun et al., 1998)
for extracting visual features and the decoder consists of
LSTM (Hochreiter & Schmidhuber, 1997) or RNN (Kom-
brink et al., 2011) modules for generating the caption text.
Literature has shown that the lower convolutional layers in
CNN capture low-level features and different partitions of
input images (Lee et al., 2009). Figure 7 shows an exam-
ple of extracted features by CNN layers trained by facial
images. Different facial organs and features have been ex-
tracted in lower layers of CNN. Therefore, as Fig. 8-b
shows, we can consider the extracted low-layer features,
which are different parts of image, as the hidden vectors
{hj}n0
j=1 in Eq. (4), where n0
is the number of features for
extracted image partitions. Similarity with latent vectors
of decoder (LSTM) is computed by Eq. (4) and the query-
retrieval model of attention mechanism, introduced before,
is used to learn a self-attention on the images. Note that,
as the partitions of image are considered to be the hidden
variables used for attention, the model attends to important
parts of input image; e.g., see Fig. 1.
Note that using attention in different fields of science is
usually referred to as “attend, tell, and do something...”.
Some examples of applications of attention are caption
generation for images (Xu et al., 2015), caption gener-
ation for images with ownership protection (Lim et al.,
2020), text reading from images containing a text (Li et al.,
2019a), translation of one image to another related image
(Zhang et al., 2018; Yang et al., 2019a), visual question
answering (Kazemi & Elqursh, 2017), human-robot social
interaction (Qureshi et al., 2017), and speech recognition
6. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 6
Figure 8. Using attention in computer vision: (a) transformer of image to caption with CNN for its encoder and RNN or LSTM for
its decoder. The caption for this image is “Elephant is in water”, (b) the convolutional filters take the values from the image for the
query-retrieval modeling in attention mechanism.
(Chan et al., 2015; 2016).
3. Transformers
3.1. The Concept of Transformation
As was explained in Section 2.1, we can have an autoen-
coder which takes an input, embeds data to a context vec-
tor which simulates a concept in human’s mind (Perlovsky,
2006), and generates an output. The input and output are
related to each other through the context vector. In other
words, the autoencoder transforms the input to a related
output. An example for transformation is translating a sen-
tence from a language to the same sentence in another lan-
guage. Another example for transformer is image caption-
ing in which the image is transformed to its caption ex-
plaining the content of image. A pure computer vision ex-
ample for transformation is transforming a day-time input
image and convert it to the same image but at night time.
An autoencoder, named “transformer”, is proposed in the
literature for the task for transformation (Vaswani et al.,
2017). The structure of transformer is depicted in Fig. 9.
As this figure shows, a transformer is an autoencoder con-
sisting of an encoder and a decoder. In the following, we
explain the details of encoder and decoder of a transformer.
3.2. Encoder of Transformer
The encoder part of transformer, illustrated in Fig. 9, em-
beds the input sequence of n words X ∈ Rd×n
into context
vectors with the attention mechanism. Different parts of the
encoder are explained in the following.
3.2.1. POSITIONAL ENCODING
We will explain in Section 3.4 that the transformer in-
troduced here does not have any recurrence and RNN or
LSTM module. As there is no recurrence and no convolu-
tion, the model has no sense of order in sequence. As the
order of words is important for meaning of sentences, we
need a way to account for the order of tokens or words in
the sequence. For this, we can add a vector accounting for
the position to each input word embedding.
Consider the embedding of the i-th word in the sequence,
denoted by xi ∈ Rd
. For encoding the position of the i-th
word in the sequence, the position vector pi ∈ Rd
can be
set as:
pi(2j + 1) := cos
i
10000
2j
p
,
pi(2j) := sin
i
10000
2j
p
,
(17)
for all j ∈ {0, 1, . . . , bd/2c}, where pi(2j + 1) and pi(2j)
denote the odd and even elements of pi, respectively. Fig-
ure 10 illustrates the dimensions of the position vectors
across different positions. As can be seen in this figure,
the position vectors for different positions of words are dif-
ferent as expected. Moreover, this figure shows that the dif-
ference of position vectors concentrate more on the initial
dimensions of vectors. As Fig. 9 shows, for incorporating
the information of position with data, we add the positional
encoding to the input embedding:
xi ← xi + pi. (18)
7. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 7
Figure 9. The encoder and decoder parts of a transformer. The
credit of this image is for (Vaswani et al., 2017).
3.2.2. MULTIHEAD ATTENTION WITH
SELF-ATTENTION
After positional encoding, data are fed to a multihead atten-
tion module with self-attention. The multihead attention is
illustrated in Fig. 11. This module applies the attention
mechanism for h times. This several repeats of attention is
for the reason explained here. The first attention determines
how much every word attends other words. The second re-
peat of attention calculates how much every pair of words
attends other pairs of words. Likewise, the third repeat of
attention sees how much every pair of pairs of words at-
tends other pairs of pairs of words; and so on. Note that
this measure of attention or similarity between hierarchical
pairs of words reminds us of the maximum mean discrep-
ancy (Gretton et al., 2007; 2012) which measures similarity
between different moments of data distributions.
As Fig. 11 shows, the data, which include positional encod-
ing, are passed from linear layers for obtaining the queries,
values, and keys. These linear layers model linear projec-
Figure 10. The vectors of positional encoding. In this example, it
is assumed that n = 10 (number of positions of words), d = 60,
and p = 100.
Figure 11. Multihead attention with h heads.
tion introduced in Eqs. (13), (15), and (14). We have h of
these linear layers to generate h set of queries, value, and
8. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 8
keys as:
Rp×n
3 Qi = W
Q,iX, ∀i ∈ {1, . . . , h}, (19)
Rp×n
3 V i = W
V,iX, ∀i ∈ {1, . . . , h}, (20)
Rr×n
3 Ki = W
K,iX, ∀i ∈ {1, . . . , h}. (21)
Then, the scaled dot product similarity, defined in Eq. (10),
is used to generate the h attention values {Zi}h
i=1 defined
by Eq. (16). These h attention values are concatenated to
make a new long flattened vector. Then, by a linear layer,
which is a linear projection, the total attention value, Zt, is
obtained:
Zt := W
O concat(Z1, Z2, . . . , Zh). (22)
3.2.3. LAYER NORMALIZATION
As Fig. 9 shows, the data (containing positional encoding)
and the total attention value are added:
Z0
t ← Zt + X. (23)
This addition is inspired by the concept of residual intro-
duced by ResNet (He et al., 2016). After this addition, a
layer normalization is applied where for each hidden unit
hi we have:
hi ←
g
σ
(hi − µ), (24)
where µ and σ are the empirical mean and standard devia-
tion over H hidden units:
µ :=
1
H
H
X
i=1
hi, (25)
σ :=
v
u
u
t
H
X
i=1
(hi − µ)2. (26)
This is a standardization which makes the mean zero and
the variance one; it is closely related to batch normalization
and reduces the covariate shift (Ioffe Szegedy, 2015).
3.2.4. FEEDFORWARD LAYER
Henceforth, let Z0
t denote the total attention after both ad-
dition and layer normalization. We feed Z0
t to a feedfor-
ward network, having nonlinear activation functions, and
then liek before, we add the input of feedforward network
to its output:
Z00
t ← R + Z0
t, (27)
where R denotes the output of feedforward network.
Again, layer normalization is applied and we, henceforth,
denote the output of encoder by Z00
t . This is the encoding
for the whole input sequence or sentence having the infor-
mation of attention of words and hierarchical pairs of words
on each other.
3.2.5. STACKING
As Fig. 9 shows, the encoder is a stack of N identical lay-
ers. This stacking is for having more learn-able parameters
to have enough degree of freedom to learn the whole dic-
tionary of words. Through experiments, a good number of
stacks is found to be N = 6 (Vaswani et al., 2017).
3.3. Decoder of Transformer
The decoder part of transformer is shown in Fig. 9. In the
following, we explain the different parts of decoder.
3.3.1. MASKED MULTIHEAD ATTENTION WITH
SELF-ATTENTION
A part of decoder is the masked multihead attention module
whose input is the shifted output embeddings {yi}n
i=1 one
word to the right. Positional encoding is also added to the
output embeddings for including the information of their
positions. For this, we use Eq. (18) where xi is replaced
by yi.
The output embeddings added with the positional encod-
ings are fed to the masked multihead attention module.
This module is similar to the multihead attention mod-
ule but masks away the forthcoming words after a word.
Therefore, every output word only attends to its previous
output words, every pairs of output words attends to its
previous pair of output words, every pair of pairs of output
words attends to its previous pair of pairs of output words,
and so on. The reason for using the masked version of
multihead attention for the output embeddings is that when
we are generating the output text, we do not have the next
words yet because the next words are not generated yet. It
is noteworthy that this masking imposes some idea of spar-
sity which was also introduced by the dropout technique
(Srivastava et al., 2014) but in a stochastic manner.
Recall Eq. (16) which was used for multihead attention
(see Section 3.2.2). The masked multihead attention is de-
fined as:
Rr×n
3 Zm := maskedAttention(Q, K, V )
= V softmax
1
√
p
Q
K + M
,
(28)
where the mask matrix M ∈ Rn×n
is:
M(i, j) :=
0 if j ≤ i,
−∞ if j i.
(29)
As the softmax function has exponential operator, the mask
does not have any impact for j ≤ i (because it is multiplied
by e0
= 1) and masks away for j i (because it is multi-
plied by e−∞
= 0). Note that j ≤ i and j i correspond
to the previous and next words, respectively, in terms of
position in the sequence.
Similar to before, the output of masked multihead attention
is normalized and then is added to its input.
9. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 9
Table 1. Comparison of complexities between self-attention and recurrence (Vaswani et al., 2017).
Complexity per layer Sequential operations Maximum path length
Self-Attention O(n2
p) O(1) O(1)
Recurrence O(np2
) O(n) O(n)
3.3.2. MULTIHEAD ATTENTION WITH
CROSS-ATTENTION
As Figs. 9 and 11 illustrate, the output of masked multihead
attention module is fed to a multihead attention module
with cross-attention. This module is not self-attention be-
cause all its values, keys, and queries are not from the same
sequence but its values and keys are from the output of en-
coder and the queries are from the output of the masked
multihead attention module in the decoder. In other words,
the values and keys come from the processed input embed-
dings and the queries are from the processed output embed-
dings. The calculated multihead attention determines how
much every output embedding attends to the input embed-
dings, how much every pair of output embeddings attends
to the pairs of input embeddings, how much every pair of
pairs of output embeddings attends to the pairs of pairs of
input embeddings, and so on. This shows the connection
between input sequence and the generated output sequence.
3.3.3. FEEDFORWARD LAYER AND SOFTMAX
ACTIVATION
Again, the output of the multihead attention module with
cross-attention is normalized and added to its input. Then,
it is fed to a feedforward neural network with layer nor-
malization and adding to its input afterwards. Note that the
masked multihead attention, the multihead attention with
cross-attention, and the feedforward network are stacked
for N = 6 times.
The output of feedforward network passes through a linear
layer by linear projection and a softmax activation function
is applied finally. The number of output neurons with the
softmax activation functions is the number of all words in
the dictionary which is a large number. The outputs of de-
coder sum to one and are the probability of every word in
the dictionary to be the generated next word. For the sake
of sequence generation, the token or word with the largest
probability is the next word.
3.4. Attention is All We Need!
3.4.1. NO NEED TO RNN!
As Fig. 9 illustrates, the output of decoder is fed to the
masked multihead attention module of decoder with some
shift. Note that this is not a notion of recurrence because
it can be interpreted by the procedure of teacher-forcing
(Kolen Kremer, 2001). Hence, we see that there is not
any recurrent module like RNN (Kombrink et al., 2011)
and LSTM (Hochreiter Schmidhuber, 1997) in trans-
former. We showed that we can learn a sequence using
the transformer. Therefore, attention is all we need to learn
a sequence and there is no need to any recurrence module.
The proposal of transformers (Vaswani et al., 2017) was a
breakthrough in NLP; the state-of-the-art NLP methods are
all based on transformers nowadays.
3.4.2. COMPLEXITY COMPARISON
Table 1 reports the complexity of operations in the self-
attention mechanism and compares them with those in re-
currence such as RNN. In self-attention, we learn attention
of every word to every other word in the sequence of n
words. Also, we learn a p-dimensional embedding for ev-
ery word. Hence, the complexity of operations per layer
is O(n2
p). This is while the complexity per layer in re-
currence is O(np2
). Although, the complexity per layer in
self-attention is worse than recurrence, many of its opera-
tions can be performed in parallel because all the words of
sequence are processed simultaneously, as also explained
in the following. Hence, the O(n2
p) is not very bad for
being able to parallelize it. That is while the recurrence
cannot be parallelized for its sequential nature.
As for the number of sequential operations, the self-
attention mechanism processes all the n words simultane-
ously so its sequential operations is in the order of O(1). As
recurrence should process the words sequentially, the num-
ber of its sequential operations is of order O(n). As for
the maximum path length between every two words, self-
attention learns attention between every two words; hence,
its maximum path length is of the order O(1). However,
in recurrence, as every word requires a path with a length
of a fraction of sequence (a length of n in the worst case)
to reach the process of another word, its maximum path
length is O(n). This shows that attention reduces both se-
quential operations and maximum path length, compared
to recurrence.
4. BERT: Bidirectional Encoder
Representations from Transformers
BERT (Devlin et al., 2018) is one of the state-of-the-art
methods for NLP. It is a stack of encoders of transformer
(see Fig. 9). In other words, it is built using transformer
encoder blocks. Although some NLP methods, such as
XLNet (Yang et al., 2019b) have slightly outperformed it,
BERT is still one of the best models for different NLP tasks
such as question answering (Qu et al., 2019), natural lan-
guage understanding (Dong et al., 2019), sentiment analy-
10. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 10
Figure 12. Feeding sentences with missing words to the BERT
model for training.
sis, and language inference (Song et al., 2020).
BERT uses the technique of masked language modeling. It
masks 15% of words in the input document/corpus and asks
the model to predict the missing words. As Fig. 12 depicts,
a sentence with a missing word is given to every trans-
former encoding block in the stack and the block is sup-
posed to predict the missing word. 15% of words are miss-
ing in the sentences and every missing word is assigned to
every encoder block in the stack. It is an unsupervised man-
ner because any word can be masked in a sentence and the
output is supposed to be that word. As it is unsupervised
and does not require labels, the huge text data of Internet
can be used for training the BERT model where words are
randomly selected to be masked.
Note that BERT learns to predict the missing word based
on attention to its previous and forthcoming words, it is
bidirectional. Hence, BERT jointly conditions on both
left (previous) and right (forthcoming) context of every
word. Moreover, as the missing word is predicted based
on the other words of sentence, BERT embeddings for
words are context-aware embeddings. Therefore, in con-
trast to word2vec (Mikolov et al., 2013a;b; Goldberg
Levy, 2014; Mikolov et al., 2015) and GloVe (Pennington
et al., 2014) which provide a single embedding per every
word, every word has different BERT embeddings in var-
ious sentences. The BERT embeddings of words differ in
different sentences based on their context. For example, the
word “bank” has different meanings and therefore different
embeddings in the sentences “Money is in the bank” and
“Some plants grow in bank of rivers”.
It is also noteworthy that, for an input sentence, BERT out-
puts an embedding for the whole sentence in addition to
giving embeddings for every word of the sentence. This
sentence embedding is not perfect but works well enough
in applications. One can use the BERT sentence embedding
and train a classifier on them for the task of spam detection
or sentiment analysis.
During training the BERT model, in addition to learning the
embeddings for the words and the whole sentence, paper
(Devlin et al., 2018) has also learned an additional task.
This task is given two sentences A and B, is B likely to be
the sentence that follows A or not?
The BERT model is usually not trained from the scratch as
its training has been done in a long time on huge amount
of Internet data. For using it in different NLP applications,
such as sentiment analysis, researchers usually add one or
several neural network layers on top of a pre-trained BERT
model and train the network for their own task. During
training, one can either lock the weights of the BERT model
or fine tune them by backpropagation.
The parameters of encoder in transformer (Vaswani et al.,
2017) are 6 encoder layers, 512 hidden layer units in the
fully connected network and 8 attention heads (h = 8).
This is while BERT (Devlin et al., 2018) has 24 encoder
layers, 1024 hidden layer units in the fully connected net-
work and 16 attention heads (h = 16). Usually, when
we say BERT, we mean the large BERT (Devlin et al.,
2018) with the above mentioned parameters. As the BERT
model is huge and requires a lot of memory for saving
the model, it cannot easily be used in embedded systems.
Hence, many commercial smaller versions of BERT are
proposed with less number of parameters and number of
stacks. Some of these smaller versions of BERT are small
BERT (Tsai et al., 2019), tiny BERT (Jiao et al., 2019), Dis-
tilBERT (Sanh et al., 2019), and Roberta BERT (Staliūnaitė
Iacobacci, 2020). Some BERT models, such as clinical
BERT (Alsentzer et al., 2019) and BioBERT (Lee et al.,
2020), have also been trained on medical texts for the
biomedical applications.
5. GPT: Generative Pre-trained Transformer
GPT, or GPT-1, (Radford et al., 2018) is another state-of-
the-art method for NLP. It is a stack of decoders of trans-
former (see Fig. 9). In other words, it is built using trans-
former decoder blocks. In GPT, the multihead attention
module with cross-attention is removed from the decoder of
transformer because there is no encoder in GPT. Hence, the
decoder blocks used in GPT have only positional encoding,
masked multihead self-attention module and feedforward
network with their adding, layer normalization, and activa-
tion functions.
Note that as GPT uses the masked multihead self-attention,
it considers attention of word, pairs of words, pairs of pairs
of words, and so on, only on the previous (left) words, pairs
of words, pairs of pairs of words, and so on. In other words,
GPT only conditions on the previous words and not the
forthcoming words. As was explained before, the objec-
tive of BERT was to predict a masked word in a sentence.
11. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 11
However, GPT model is used for language model (Rosen-
feld, 2000; Jozefowicz et al., 2016; Jing Xu, 2019) whose
objective is to predict the next word, in an incomplete sen-
tence, given all of the previous words. The predicted new
word is then added to the sequence and is fed to the GPT as
input again and the next other word is predicted. This goes
on until the sentences get complete with their next com-
ing words. In other words, GPT model takes some docu-
ment and continues the text in the best and related way. For
example, if the input sentences are about psychology, the
trained GPT model generates the next sentences also about
psychology to complete the document. Note that as any text
without label can be used for predicting the next words in
sentences, GPT is an unsupervised method making it pos-
sible to be trained on huge amount of Internet data.
The successors of GPT-1 (Radford et al., 2018) are GPT-
2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020).
GPT-2 and GPT-3 are extension of GPT-1 with more num-
ber of stacks of transformer decoder. Hence, they have
more learn-able parameters and can be trained with more
data for better language modeling and inference. For ex-
ample, GPT-2 has 1.5 billion parameters. GPT-2 and es-
pecially GPT-3 have been trained with much more Internet
data with various general and academic subjects to be able
to generate text in any subject and style of interest. For
example, GPT-2 has been trained on 8 million web pages
which contain 40GB of Internet text data.
GPT-2 is a quite large model and cannot be easily used
in embedded systems because of requiring large memory.
Hence, different sizes of GPT-2, like small, medium, large,
Xlarge, DistilGPT-2, are provided for usage in embedded
systems, where the number of stacks and learn-able param-
eters differ in these versions. These versions of GPT-2 can
be found and used in the HuggingFace transformer Python
package (Wolf et al., 2019a).
GPT-2 has been used in many different applications such
as dialogue systems (Budzianowski Vulić, 2019), patent
claim generation (Lee Hsiang, 2019), and medical text
simplification (Van et al., 2020). A combination of GPT-2
and BERT has been used for question answering (Klein
Nabi, 2019). It is noteworthy that GPT can be seen as few
shot learning (Brown et al., 2020). A comparison of GPT
and BERT can also be found in (Ethayarajh, 2019).
GPT-3 is a very huge version of GPT with so many number
of stacks and learn-able parameters. For comparison, note
that GPT-2, NVIDIA Megatron (Shoeybi et al., 2019), Mi-
crosoft Turing-NLG (Microsoft, 2020), and GPT-3 (Brown
et al., 2020) have 1.5 billion, 8 billion, 17 billion, and 175
billion learn-able parameters, respectively. This huge num-
ber of parameters allows GPT-3 to be trained by very huge
amount of Internet text data on various subjects and topics.
Hence, GPT-3 has been able to learn almost all topics of
documents and even some people are discussing whether it
can pass the writer’s Turing’s test (Elkins Chun, 2020;
Floridi Chiriatti, 2020). Note that GPT-3 has kind of
memorized the texts of all subjects but not in bad way, i.e.,
overfitting, but in a good way. This memorization is be-
cause of the complexity of huge number of learn-able pa-
rameters (Arpit et al., 2017) and not being overfitted is be-
cause of being trained by big enough Internet data.
GPT-3 has had many different interesting applications such
as fiction and poetry generation (Branwen, 2020). Of
course, it is causing some risks, too (McGuffie New-
house, 2020).
6. Conclusion
Transformers are very essential tools in natural language
processing and computer vision. This paper was a tuto-
rial and survey paper on attention mechanism, transform-
ers, BERT, and GPT. We explained attention mechanism,
the sequence-to-sequence model with and without atten-
tion, and self-attention. The different parts of encoder and
decoder of a transformer were explained. Finally, BERT
and GPT were introduced as stacks of the encoders and de-
coders of transformer, respectively.
Acknowledgment
The authors hugely thank Prof. Pascal Poupart whose
course partly covered some of materials in this tutorial pa-
per. Some of the materials of this paper can also be found
in Prof. Ali Ghodsi’s course videos.
References
Alsentzer, Emily, Murphy, John R, Boag, Willie, Weng,
Wei-Hung, Jin, Di, Naumann, Tristan, and McDermott,
Matthew. Publicly available clinical BERT embeddings.
arXiv preprint arXiv:1904.03323, 2019.
Arpit, Devansh, Jastrzebski, Stanisław, Ballas, Nicolas,
Krueger, David, Bengio, Emmanuel, Kanwal, Maxin-
der S, Maharaj, Tegan, Fischer, Asja, Courville, Aaron,
Bengio, Yoshua, and Lacoste-Julien, Simon. A closer
look at memorization in deep networks. In International
Conference on Machine Learning, 2017.
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio,
Yoshua. Neural machine translation by jointly learning
to align and translate. In International Conference on
Learning Representations, 2015.
Ben-David, Shai, Blitzer, John, Crammer, Koby, Kulesza,
Alex, Pereira, Fernando, and Vaughan, Jennifer Wort-
man. A theory of learning from different domains. Ma-
chine learning, 79(1-2):151–175, 2010.
Branwen, Gwern. GPT-3 creative fiction. https://
www.gwern.net/GPT-3, 2020.
12. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 12
Brown, Tom B, Mann, Benjamin, Ryder, Nick, Subbiah,
Melanie, Kaplan, Jared, Dhariwal, Prafulla, Neelakan-
tan, Arvind, Shyam, Pranav, Sastry, Girish, Askell,
Amanda, et al. Language models are few-shot learners.
In Advances in neural information processing systems,
2020.
Budzianowski, Paweł and Vulić, Ivan. Hello, it’s GPT-2–
how can I help you? Towards the use of pretrained lan-
guage models for task-oriented dialogue systems. arXiv
preprint arXiv:1907.05774, 2019.
Chan, William, Jaitly, Navdeep, Le, Quoc V, and Vinyals,
Oriol. Listen, attend and spell. arXiv preprint
arXiv:1508.01211, 2015.
Chan, William, Jaitly, Navdeep, Le, Quoc, and Vinyals,
Oriol. Listen, attend and spell: A neural network
for large vocabulary conversational speech recognition.
In 2016 IEEE International Conference on Acoustics,
Speech and Signal Processing, pp. 4960–4964. IEEE,
2016.
Cheng, Jianpeng, Dong, Li, and Lapata, Mirella. Long
short-term memory-networks for machine reading. In
Conference on Empirical Methods in Natural Language
Processing, pp. 551–561, 2016.
Chorowski, Jan, Bahdanau, Dzmitry, Cho, Kyunghyun, and
Bengio, Yoshua. End-to-end continuous speech recog-
nition using attention-based recurrent NN: First results.
arXiv preprint arXiv:1412.1602, 2014.
Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, and
Toutanova, Kristina. BERT: Pre-training of deep bidi-
rectional transformers for language understanding. arXiv
preprint arXiv:1810.04805, 2018.
Dong, Li, Yang, Nan, Wang, Wenhui, Wei, Furu, Liu, Xi-
aodong, Wang, Yu, Gao, Jianfeng, Zhou, Ming, and Hon,
Hsiao-Wuen. Unified language model pre-training for
natural language understanding and generation. In Ad-
vances in Neural Information Processing Systems, pp.
13063–13075, 2019.
Dou, Qi, Coelho de Castro, Daniel, Kamnitsas, Konstanti-
nos, and Glocker, Ben. Domain generalization via
model-agnostic learning of semantic features. Advances
in Neural Information Processing Systems, 32:6450–
6461, 2019.
Elkins, Katherine and Chun, Jon. Can GPT-3 pass a writer’s
Turing test? Journal of Cultural Analytics, 2371:4549,
2020.
Ethayarajh, Kawin. How contextual are contextualized
word representations? Comparing the geometry of
BERT, ELMo, and GPT-2 embeddings. arXiv preprint
arXiv:1909.00512, 2019.
Floridi, Luciano and Chiriatti, Massimo. GPT-3: Its nature,
scope, limits, and consequences. Minds and Machines,
pp. 1–14, 2020.
Garcia-Molina, Hector, D. Ullman, Jeffrey, and Widom,
Jennifer. Database systems: The complete book. Pren-
tice Hall, 1999.
Goldberg, Yoav and Levy, Omer. word2vec explained:
deriving Mikolov et al.’s negative-sampling word-
embedding method. arXiv preprint arXiv:1402.3722,
2014.
Gretton, Arthur, Borgwardt, Karsten, Rasch, Malte,
Schölkopf, Bernhard, and Smola, Alex J. A kernel
method for the two-sample-problem. In Advances in
neural information processing systems, pp. 513–520,
2007.
Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte J,
Schölkopf, Bernhard, and Smola, Alexander. A kernel
two-sample test. The Journal of Machine Learning Re-
search, 13(1):723–773, 2012.
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
Jian. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 770–778, 2016.
Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-
term memory. Neural computation, 9(8):1735–1780,
1997.
Indurkhya, Nitin and Damerau, Fred J. Handbook of natu-
ral language processing, volume 2. CRC Press, 2010.
Ioffe, Sergey and Szegedy, Christian. Batch normalization:
Accelerating deep network training by reducing internal
covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Jiao, Xiaoqi, Yin, Yichun, Shang, Lifeng, Jiang, Xin, Chen,
Xiao, Li, Linlin, Wang, Fang, and Liu, Qun. Tiny-
BERT: Distilling BERT for natural language understand-
ing. arXiv preprint arXiv:1909.10351, 2019.
Jing, Kun and Xu, Jungang. A survey on neural network
language models. arXiv preprint arXiv:1906.03591,
2019.
Jozefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer,
Noam, and Wu, Yonghui. Exploring the limits of
language modeling. arXiv preprint arXiv:1602.02410,
2016.
Kazemi, Vahid and Elqursh, Ali. Show, ask, attend, and
answer: A strong baseline for visual question answering.
arXiv preprint arXiv:1704.03162, 2017.
13. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 13
Klein, Tassilo and Nabi, Moin. Learning to answer by
learning to ask: Getting the best of gpt-2 and bert worlds.
arXiv preprint arXiv:1911.02365, 2019.
Kolen, John F and Kremer, Stefan C. A field guide to dy-
namical recurrent networks. John Wiley Sons, 2001.
Kombrink, Stefan, Mikolov, Tomáš, Karafiát, Martin, and
Burget, Lukáš. Recurrent neural network based language
modeling in meeting recognition. In Twelfth annual con-
ference of the international speech communication asso-
ciation, 2011.
LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner,
Patrick. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):2278–
2324, 1998.
Lee, Honglak, Grosse, Roger, Ranganath, Rajesh, and Ng,
Andrew Y. Convolutional deep belief networks for scal-
able unsupervised learning of hierarchical representa-
tions. In International Conference on Machine Learning,
pp. 609–616, 2009.
Lee, Jieh-Sheng and Hsiang, Jieh. Patent claim gener-
ation by fine-tuning OpenAI GPT-2. arXiv preprint
arXiv:1907.02052, 2019.
Lee, Jinhyuk, Yoon, Wonjin, Kim, Sungdong, Kim,
Donghyeon, Kim, Sunkyu, So, Chan Ho, and Kang, Jae-
woo. BioBERT: a pre-trained biomedical language rep-
resentation model for biomedical text mining. Bioinfor-
matics, 36(4):1234–1240, 2020.
Li, Da, Yang, Yongxin, Song, Yi-Zhe, and Hospedales,
Timothy M. Deeper, broader and artier domain gener-
alization. In Proceedings of the IEEE international con-
ference on computer vision, pp. 5542–5550, 2017.
Li, Hui, Wang, Peng, Shen, Chunhua, and Zhang, Guyu.
Show, attend and read: A simple and strong baseline
for irregular text recognition. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 33,
pp. 8610–8617, 2019a.
Li, Yang, Kaiser, Lukasz, Bengio, Samy, and Si, Si.
Area attention. In International Conference on Machine
Learning, pp. 3846–3855. PMLR, 2019b.
Lim, Jian Han, Chan, Chee Seng, Ng, Kam Woh, Fan,
Lixin, and Yang, Qiang. Protect, show, attend and
tell: Image captioning model with ownership protection.
arXiv preprint arXiv:2008.11009, 2020.
Luong, Minh-Thang, Pham, Hieu, and Manning, Christo-
pher D. Effective approaches to attention-based neural
machine translation. arXiv preprint arXiv:1508.04025,
2015.
McGuffie, Kris and Newhouse, Alex. The radicalization
risks of GPT-3 and advanced neural language models.
arXiv preprint arXiv:2009.06807, 2020.
Microsoft. Turing-NLG: A 17-billion-parameter language
model by Microsoft. Microsoft Blog, 2020.
Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jef-
frey. Efficient estimation of word representations in vec-
tor space. In International Conference on Learning Rep-
resentations, 2013a.
Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado,
Greg S, and Dean, Jeff. Distributed representations of
words and phrases and their compositionality. In Ad-
vances in neural information processing systems, pp.
3111–3119, 2013b.
Mikolov, Tomas, Chen, Kai, Corrado, Gregory S, and
Dean, Jeffrey A. Computing numeric representations of
words in a high-dimensional space, May 19 2015. US
Patent 9,037,464.
Pennington, Jeffrey, Socher, Richard, and Manning,
Christopher D. GloVe: Global vectors for word rep-
resentation. In Proceedings of the 2014 conference on
empirical methods in natural language processing, pp.
1532–1543, 2014.
Perlovsky, Leonid I. Toward physics of the mind: Con-
cepts, emotions, consciousness, and symbols. Physics of
Life Reviews, 3(1):23–55, 2006.
Qu, Chen, Yang, Liu, Qiu, Minghui, Croft, W Bruce,
Zhang, Yongfeng, and Iyyer, Mohit. BERT with his-
tory answer embedding for conversational question an-
swering. In Proceedings of the 42nd International ACM
SIGIR Conference on Research and Development in In-
formation Retrieval, pp. 1133–1136, 2019.
Qureshi, Ahmed Hussain, Nakamura, Yutaka, Yoshikawa,
Yuichiro, and Ishiguro, Hiroshi. Show, attend and inter-
act: Perceivable human-robot social interaction through
neural attention q-network. In 2017 IEEE International
Conference on Robotics and Automation, pp. 1639–
1645. IEEE, 2017.
Radford, Alec, Narasimhan, Karthik, Salimans, Tim, and
Sutskever, Ilya. Improving language understanding by
generative pre-training. Technical report, OpenAI, 2018.
Radford, Alec, Wu, Jeffrey, Child, Rewon, Luan, David,
Amodei, Dario, and Sutskever, Ilya. Language models
are unsupervised multitask learners. OpenAI blog, 1(8):
9, 2019.
Rosenfeld, Ronald. Two decades of statistical language
modeling: Where do we go from here? Proceedings
of the IEEE, 88(8):1270–1278, 2000.
14. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 14
Rumelhart, David E, Hinton, Geoffrey E, and Williams,
Ronald J. Learning representations by back-propagating
errors. Nature, 323(6088):533–536, 1986.
Sanh, Victor, Debut, Lysandre, Chaumond, Julien, and
Wolf, Thomas. DistilBERT, a distilled version of bert:
smaller, faster, cheaper and lighter. arXiv preprint
arXiv:1910.01108, 2019.
Shoeybi, Mohammad, Patwary, Mostofa, Puri, Raul,
LeGresley, Patrick, Casper, Jared, and Catanzaro, Bryan.
Megatron-LM: Training multi-billion parameter lan-
guage models using model parallelism. arXiv preprint
arXiv:1909.08053, 2019.
Socher, Richard, Bengio, Yoshua, and Manning, Chris.
Deep learning for nlp. Tutorial at Association of Com-
putational Logistics (ACL), 2012.
Song, Youwei, Wang, Jiahai, Liang, Zhiwei, Liu, Zhiyue,
and Jiang, Tao. Utilizing BERT intermediate layers for
aspect based sentiment analysis and natural language in-
ference. arXiv preprint arXiv:2002.04815, 2020.
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,
Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: a
simple way to prevent neural networks from overfitting.
The journal of machine learning research, 15(1):1929–
1958, 2014.
Staliūnaitė, Ieva and Iacobacci, Ignacio. Compositional
and lexical semantics in RoBERTa, BERT and Dis-
tilBERT: A case study on CoQA. arXiv preprint
arXiv:2009.08257, 2020.
Summerfield, Christopher and Egner, Tobias. Expectation
(and attention) in visual cognition. Trends in cognitive
sciences, 13(9):403–409, 2009.
Tsai, Henry, Riesa, Jason, Johnson, Melvin, Arivazhagan,
Naveen, Li, Xin, and Archer, Amelia. Small and practi-
cal BERT models for sequence labeling. arXiv preprint
arXiv:1909.00100, 2019.
Van, Hoang, Kauchak, David, and Leroy, Gondy. Au-
toMeTS: The autocomplete for medical text simplifica-
tion. arXiv preprint arXiv:2010.10573, 2020.
Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit,
Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, Łukasz,
and Polosukhin, Illia. Attention is all you need. In
Advances in neural information processing systems, pp.
5998–6008, 2017.
Wang, Yong, Wang, Longyue, Shi, Shuming, Li, Vic-
tor OK, and Tu, Zhaopeng. Go from the general to the
particular: Multi-domain translation with domain trans-
formation networks. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volume 34, pp. 9233–
9241, 2020.
Wolf, Thomas, Debut, Lysandre, Sanh, Victor, Chaumond,
Julien, Delangue, Clement, Moi, Anthony, Cistac, Pier-
ric, Rault, Tim, Louf, Rémi, Funtowicz, Morgan, et al.
HuggingFace’s transformers: State-of-the-art natural
language processing. arXiv preprint arXiv:1910.03771,
2019a.
Wolf, Thomas, Sanh, Victor, Chaumond, Julien, and De-
langue, Clement. TransferTransfo: A transfer learn-
ing approach for neural network based conversational
agents. In Proceedings of the AAAI Conference on Arti-
ficial Intelligence, 2019b.
Xu, Jun. On the techniques of English fast-reading. In
Theory and Practice in Language Studies, volume 1, pp.
1416–1419. Academy Publisher, 2011.
Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, Cho, Kyunghyun,
Courville, Aaron, Salakhudinov, Ruslan, Zemel, Rich,
and Bengio, Yoshua. Show, attend and tell: Neural im-
age caption generation with visual attention. In Interna-
tional conference on machine learning, pp. 2048–2057,
2015.
Yang, Chao, Kim, Taehwan, Wang, Ruizhe, Peng, Hao, and
Kuo, C-C Jay. Show, attend, and translate: Unsupervised
image translation with self-regularization and attention.
IEEE Transactions on Image Processing, 28(10):4845–
4856, 2019a.
Yang, Zhilin, Dai, Zihang, Yang, Yiming, Carbonell,
Jaime, Salakhutdinov, Russ R, and Le, Quoc V. XL-
net: Generalized autoregressive pretraining for language
understanding. In Advances in neural information pro-
cessing systems, pp. 5753–5763, 2019b.
Yu, Keyi, Liu, Yang, Schwing, Alexander G, and Peng,
Jian. Fast and accurate text classification: Skimming,
rereading and early stopping. In International Confer-
ence on Learning Representations, 2018.
Zhang, Honglun, Chen, Wenqing, Tian, Jidong, Wang,
Yongkun, and Jin, Yaohui. Show, attend and translate:
Unpaired multi-domain image-to-image translation with
visual attention. arXiv preprint arXiv:1811.07483, 2018.