This document analyzes the evolution of advanced transformer-based language models for opinion mining tasks. It provides background on several transformer models including BERT, GPT, ALBERT, RoBERTa, XLNet, DistilBERT, XLM-RoBERTa, BART, ConvBERT, Reformer, T5, ELECTRA, Longformer, and DeBERTa. The document compares these models based on their architecture, pre-training data, objectives, performance on tasks, and computational costs. It aims to study the behavior of these cutting-edge models on opinion mining and provide guidelines for researchers and engineers on model selection.
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONijaia
Complicated policy texts require a lot of effort to read, so there is a need for intelligent interpretation of
Chinese policies. To better solve the Chinese Text Summarization task, this paper utilized the mT5 model
as the core framework and initial weights. Additionally, In addition, this paper reduced the model size
through parameter clipping, used the Gap Sentence Generation (GSG) method as unsupervised method,
and improved the Chinese tokenizer. After training on a meticulously processed 30GB Chinese training
corpus, the paper developed the enhanced mT5-GSG model. Then, when fine-tuning the Chinese Policy
text, this paper chose the idea of “Dropout Twice”, and innovatively combined the probability distribution
of the two Dropouts through the Wasserstein distance. Experimental results indicate that the proposed
model achieved Rouge-1, Rouge-2, and Rouge-L scores of 56.13%, 45.76%, and 56.41% respectively on
the Chinese policy text summarization dataset.
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONgerogepatton
Complicated policy texts require a lot of effort to read, so there is a need for intelligent interpretation of
Chinese policies. To better solve the Chinese Text Summarization task, this paper utilized the mT5 model
as the core framework and initial weights. Additionally, In addition, this paper reduced the model size
through parameter clipping, used the Gap Sentence Generation (GSG) method as unsupervised method,
and improved the Chinese tokenizer. After training on a meticulously processed 30GB Chinese training
corpus, the paper developed the enhanced mT5-GSG model. Then, when fine-tuning the Chinese Policy
text, this paper chose the idea of “Dropout Twice”, and innovatively combined the probability distribution
of the two Dropouts through the Wasserstein distance. Experimental results indicate that the proposed
model achieved Rouge-1, Rouge-2, and Rouge-L scores of 56.13%, 45.76%, and 56.41% respectively on
the Chinese policy text summarization dataset.
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONgerogepatton
Complicated policy texts require a lot of effort to read, so there is a need for intelligent interpretation of
Chinese policies. To better solve the Chinese Text Summarization task, this paper utilized the mT5 model
as the core framework and initial weights. Additionally, In addition, this paper reduced the model size
through parameter clipping, used the Gap Sentence Generation (GSG) method as unsupervised method,
and improved the Chinese tokenizer. After training on a meticulously processed 30GB Chinese training
corpus, the paper developed the enhanced mT5-GSG model. Then, when fine-tuning the Chinese Policy
text, this paper chose the idea of “Dropout Twice”, and innovatively combined the probability distribution
of the two Dropouts through the Wasserstein distance. Experimental results indicate that the proposed
model achieved Rouge-1, Rouge-2, and Rouge-L scores of 56.13%, 45.76%, and 56.41% respectively on
the Chinese policy text summarization dataset.
Crafting Your Customized Legal Mastery: A Guide to Building Your Private LLMChristopherTHyatt
Create a tailored legal education with a private LLM. Identify your specialization, research courses from reputable institutions, and leverage online platforms for flexibility. Craft a unique curriculum combining law with interdisciplinary studies, enhancing your expertise. Network with professionals, balance theory with practical experience, and stay updated on legal trends. Build a personalized learning journey to unlock your full potential in the legal landscape.
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONijaia
Complicated policy texts require a lot of effort to read, so there is a need for intelligent interpretation of
Chinese policies. To better solve the Chinese Text Summarization task, this paper utilized the mT5 model
as the core framework and initial weights. Additionally, In addition, this paper reduced the model size
through parameter clipping, used the Gap Sentence Generation (GSG) method as unsupervised method,
and improved the Chinese tokenizer. After training on a meticulously processed 30GB Chinese training
corpus, the paper developed the enhanced mT5-GSG model. Then, when fine-tuning the Chinese Policy
text, this paper chose the idea of “Dropout Twice”, and innovatively combined the probability distribution
of the two Dropouts through the Wasserstein distance. Experimental results indicate that the proposed
model achieved Rouge-1, Rouge-2, and Rouge-L scores of 56.13%, 45.76%, and 56.41% respectively on
the Chinese policy text summarization dataset.
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONgerogepatton
Complicated policy texts require a lot of effort to read, so there is a need for intelligent interpretation of
Chinese policies. To better solve the Chinese Text Summarization task, this paper utilized the mT5 model
as the core framework and initial weights. Additionally, In addition, this paper reduced the model size
through parameter clipping, used the Gap Sentence Generation (GSG) method as unsupervised method,
and improved the Chinese tokenizer. After training on a meticulously processed 30GB Chinese training
corpus, the paper developed the enhanced mT5-GSG model. Then, when fine-tuning the Chinese Policy
text, this paper chose the idea of “Dropout Twice”, and innovatively combined the probability distribution
of the two Dropouts through the Wasserstein distance. Experimental results indicate that the proposed
model achieved Rouge-1, Rouge-2, and Rouge-L scores of 56.13%, 45.76%, and 56.41% respectively on
the Chinese policy text summarization dataset.
AN IMPROVED MT5 MODEL FOR CHINESE TEXT SUMMARY GENERATIONgerogepatton
Complicated policy texts require a lot of effort to read, so there is a need for intelligent interpretation of
Chinese policies. To better solve the Chinese Text Summarization task, this paper utilized the mT5 model
as the core framework and initial weights. Additionally, In addition, this paper reduced the model size
through parameter clipping, used the Gap Sentence Generation (GSG) method as unsupervised method,
and improved the Chinese tokenizer. After training on a meticulously processed 30GB Chinese training
corpus, the paper developed the enhanced mT5-GSG model. Then, when fine-tuning the Chinese Policy
text, this paper chose the idea of “Dropout Twice”, and innovatively combined the probability distribution
of the two Dropouts through the Wasserstein distance. Experimental results indicate that the proposed
model achieved Rouge-1, Rouge-2, and Rouge-L scores of 56.13%, 45.76%, and 56.41% respectively on
the Chinese policy text summarization dataset.
Crafting Your Customized Legal Mastery: A Guide to Building Your Private LLMChristopherTHyatt
Create a tailored legal education with a private LLM. Identify your specialization, research courses from reputable institutions, and leverage online platforms for flexibility. Craft a unique curriculum combining law with interdisciplinary studies, enhancing your expertise. Network with professionals, balance theory with practical experience, and stay updated on legal trends. Build a personalized learning journey to unlock your full potential in the legal landscape.
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKijnlc
In recent years, there has been an increasing use of social media among people in Myanmar and writing review on social media pages about the product, movie, and trip are also popular among people. Moreover, most of the people are going to find the review pages about the product they want to buy before deciding whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very important and time consuming for people. Sentiment analysis is one of the important processes for extracting useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar Language.
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
In recent years, there has been an increasing use of social media among people in Myanmar and writing
review on social media pages about the product, movie, and trip are also popular among people. Moreover,
most of the people are going to find the review pages about the product they want to buy before deciding
whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very
important and time consuming for people. Sentiment analysis is one of the important processes for extracting
useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is
proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The
paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar
Language.
Performance Comparison between Pytorch and Mindsporeijdms
Deep learning has been well used in many fields. However, there is a large amount of data when training neural networks, which makes many deep learning frameworks appear to serve deep learning practitioners, providing services that are more convenient to use and perform better. MindSpore and PyTorch are both deep learning frameworks. MindSpore is owned by HUAWEI, while PyTorch is owned by Facebook. Some people think that HUAWEI's MindSpore has better performance than FaceBook's PyTorch, which makes deep learning practitioners confused about the choice between the two. In this paper, we perform analytical and experimental analysis to reveal the comparison of training speed of MIndSpore and PyTorch on a single GPU. To ensure that our survey is as comprehensive as possible, we carefully selected neural networks in 2 main domains, which cover computer vision and natural language processing (NLP). The contribution of this work is twofold. First, we conduct detailed benchmarking experiments on MindSpore and PyTorch to analyze the reasons for their performance differences. This work provides guidance for end users to choose between these two frameworks.
This paper discusses the capabilities and limitations of GPT-3 (0), a state-of-the-art language model, in the
context of text understanding. We begin by describing the architecture and training process of GPT-3, and
provide an overview of its impressive performance across a wide range of natural language processing
tasks, such as language translation, question-answering, and text completion. Throughout this research
project, a summarizing tool was also created to help us retrieve content from any types of document,
specifically IELTS (0) Reading Test data in this project. We also aimed to improve the accuracy of the
summarizing, as well as question-answering capabilities of GPT-3 (0) via long text
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc
With the recent developments in the field of Natural Language Processing, there has been a rise in the use
of different architectures for Neural Machine Translation. Transformer architectures are used to achieve
state-of-the-art accuracy, but they are very computationally expensive to train. Everyone cannot have such
setups consisting of high-end GPUs and other resources. We train our models on low computational
resources and investigate the results. As expected, transformers outperformed other architectures, but
there were some surprising results. Transformers consisting of more encoders and decoders took more
time to train but had fewer BLEU scores. LSTM performed well in the experiment and took comparatively
less time to train than transformers, making it suitable to use in situations having time constraints.
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...kevig
With the recent developments in the field of Natural Language Processing, there has been a rise in the use
of different architectures for Neural Machine Translation. Transformer architectures are used to achieve
state-of-the-art accuracy, but they are very computationally expensive to train. Everyone cannot have such
setups consisting of high-end GPUs and other resources. We train our models on low computational
resources and investigate the results. As expected, transformers outperformed other architectures, but
there were some surprising results. Transformers consisting of more encoders and decoders took more
time to train but had fewer BLEU scores. LSTM performed well in the experiment and took comparatively
less time to train than transformers, making it suitable to use in situations having time constraints
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...kevig
With the recent developments in the field of Natural Language Processing, there has been a rise in the use
of different architectures for Neural Machine Translation. Transformer architectures are used to achieve
state-of-the-art accuracy, but they are very computationally expensive to train. Everyone cannot have such
setups consisting of high-end GPUs and other resources. We train our models on low computational
resources and investigate the results. As expected, transformers outperformed other architectures, but
there were some surprising results. Transformers consisting of more encoders and decoders took more
time to train but had fewer BLEU scores. LSTM performed well in the experiment and took comparatively
less time to train than transformers, making it suitable to use in situations having time constraints.
Foundation models are pre-trained models that can be fine-tuned on specific tasks or domains. These highly adaptable and high-performing models find applications across diverse domains, including Natural Language Processing (NLP), computer vision, and multimodal tasks.
Transformer Models have taken over most of the Natural language Inference tasks. In recent
times they have proved to beat several benchmarks. Chunking means splitting the sentences into
tokens and then grouping them in a meaningful way. Chunking is a task that has gradually
moved from POS tag-based statistical models to neural nets using Language models such as
LSTM, Bidirectional LSTMs, attention models, etc. Deep neural net Models are deployed
indirectly for classifying tokens as different tags defined under Named Recognition Tasks. Later
these tags are used in conjunction with pointer frameworks for the final chunking task. In our
paper, we propose an Ensemble Model using a fine-tuned Transformer Model and a recurrent
neural network model together to predict tags and chunk substructures of a sentence. We
analyzed the shortcomings of the transformer models in predicting different tags and then
trained the BILSTM+CNN accordingly to compensate for the same.
A comprehensive guide to prompt engineering.pdfStephenAmell4
Prompt engineering is the practice of designing and refining specific text prompts to guide transformer-based language models, such as Large Language Models (LLMs), in generating desired outputs. It involves crafting clear and specific instructions and allowing the model sufficient time to process information. By carefully engineering prompts, practitioners can harness the capabilities of LLMs to achieve different goals.
A comprehensive guide to prompt engineering.pdfStephenAmell4
Prompt engineering is the practice of designing and refining specific text prompts to guide transformer-based language models, such as Large Language Models (LLMs), in generating desired outputs. It involves crafting clear and specific instructions and allowing the model sufficient time to process information.
Prompt engineering is the practice of designing and refining specific text prompts to guide transformer-based language models, such as Large Language Models (LLMs), in generating desired outputs. It involves crafting clear and specific instructions and allowing the model sufficient time to process information. By carefully engineering prompts, practitioners can harness the capabilities of LLMs to achieve different goals.
STOCKGRAM : DEEP LEARNING MODEL FOR DIGITIZING FINANCIAL COMMUNICATIONS VIA N...kevig
This paper proposes a deep learning model, StockGram, to automate financial communications via natural language generation. StockGram is a seq2seq model that generates short and coherent versions of financial news reports based on the client's point of interest from numerous pools of verified resources. The proposed model is developed to mitigate the pain points of advisors who invest numerous hours while scanning through these news reports manually. StockGram leverages bi-directional LSTM cells that allows a recurrent system to make its prediction based on both past and future word sequences and hence predicts the next word in the sequence more precisely. The proposed model utilizes custom word-embeddings, GloVe, which incorporates global statistics to generate vector representations of news articles in an unsupervised manner and allows the model to converge faster. StockGram is evaluated based on the semantic closeness of the generated report to the provided prime words.
STOCKGRAM : DEEP LEARNING MODEL FOR DIGITIZING FINANCIAL COMMUNICATIONS VIA N...ijnlc
This paper proposes a deep learning model, StockGram, to automate financial communications via natural language generation. StockGram is a seq2seq model that generates short and coherent versions of financial news reports based on the client's point of interest from numerous pools of verified resources. The proposed model is developed to mitigate the pain points of advisors who invest numerous hours while
scanning through these news reports manually. StockGram leverages bi-directional LSTM cells that allows a recurrent system to make its prediction based on both past and future word sequences and hence predicts the next word in the sequence more precisely. The proposed model utilizes custom word-embeddings, GloVe, which incorporates global statistics to generate vector representations of news articles in an unsupervised manner and allows the model to converge faster. StockGram is evaluated based on the semantic closeness of the generated report to the provided prime words.
Chunking means splitting the sentences into tokens and then grouping them in a meaningful way. When it comes to high-performance chunking systems, transformer models have proved to be the state of the art benchmarks. To perform chunking as a task it requires a large-scale high quality annotated corpus where each token is attached with a particular tag similar as that of Named Entity Recognition Tasks. Later these tags are used in conjunction with pointer frameworks to find the final chunk. To solve this for a specific domain problem, it becomes a highly costly affair in terms of time and resources to manually annotate and produce a large-high-quality training set. When the domain is specific and diverse, then cold starting becomes even more difficult because of the expected large number of manually annotated queries to cover all aspects. To overcome the problem, we applied a grammar-based text generation mechanism where instead of annotating a sentence we annotate using grammar templates. We defined various templates corresponding to different grammar rules. To create a sentence we used these templates along with the rules where symbol or terminal values were chosen from the domain data catalog. It helped us to create a large number of annotated queries. These annotated queries were used for training the machine learning model using an ensemble transformer-based deep neural network model [24.] We found that grammar-based annotation was useful to solve domain-based chunks in input query sentences without any manual annotation where it was found to achieve a classification F1 score of 96.97% in classifying the tokens for the out of template queries.
Convolutional neural network with binary moth flame optimization for emotion ...IAESIJAI
Electroencephalograph (EEG) signals have the ability of real-time reflecting brain activities. Utilizing the EEG signal for analyzing human emotional states is a common study. The EEG signals of the emotions aren’t distinctive and it is different from one person to another as every one of them has different emotional responses to same stimuli. Which is why, the signals of the EEG are subject dependent and proven to be effective for the subject dependent detection of the Emotions. For the purpose of achieving enhanced accuracy and high true positive rate, the suggested system proposed a binary moth flame optimization (BMFO) algorithm for the process of feature selection and convolutional neural networks (CNNs) for classifications. In this proposal, optimum features are chosen with the use of accuracy as objective function. Ultimately, optimally chosen features are classified after that with the use of a CNN for the purpose of discriminating different emotion states.
A novel ensemble model for detecting fake newsIAESIJAI
Due the growing proliferation of fake news over the past couple of years, our objective in this paper is to propose an ensemble model for the automatic classification of article news as being either real or fake. For this purpose, we opt for a blending technique that combines three models, namely bidirectional long short-term memory (Bi-LSTM), stochastic gradient descent classifier and ridge classifier. The implementation of the proposed model (i.e. BI-LSR) on real world datasets, has shown outstanding results. In fact, it achieved an accuracy score of 99.16%. Accordingly, this ensemble learning has proven to do perform better than individual conventional machine learning and deep learning models as well as many ensemble learning approaches cited in the literature.
More Related Content
Similar to Analysis of the evolution of advanced transformer-based language models: Experiments on opinion mining
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKijnlc
In recent years, there has been an increasing use of social media among people in Myanmar and writing review on social media pages about the product, movie, and trip are also popular among people. Moreover, most of the people are going to find the review pages about the product they want to buy before deciding whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very important and time consuming for people. Sentiment analysis is one of the important processes for extracting useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar Language.
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
In recent years, there has been an increasing use of social media among people in Myanmar and writing
review on social media pages about the product, movie, and trip are also popular among people. Moreover,
most of the people are going to find the review pages about the product they want to buy before deciding
whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very
important and time consuming for people. Sentiment analysis is one of the important processes for extracting
useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is
proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The
paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar
Language.
Performance Comparison between Pytorch and Mindsporeijdms
Deep learning has been well used in many fields. However, there is a large amount of data when training neural networks, which makes many deep learning frameworks appear to serve deep learning practitioners, providing services that are more convenient to use and perform better. MindSpore and PyTorch are both deep learning frameworks. MindSpore is owned by HUAWEI, while PyTorch is owned by Facebook. Some people think that HUAWEI's MindSpore has better performance than FaceBook's PyTorch, which makes deep learning practitioners confused about the choice between the two. In this paper, we perform analytical and experimental analysis to reveal the comparison of training speed of MIndSpore and PyTorch on a single GPU. To ensure that our survey is as comprehensive as possible, we carefully selected neural networks in 2 main domains, which cover computer vision and natural language processing (NLP). The contribution of this work is twofold. First, we conduct detailed benchmarking experiments on MindSpore and PyTorch to analyze the reasons for their performance differences. This work provides guidance for end users to choose between these two frameworks.
This paper discusses the capabilities and limitations of GPT-3 (0), a state-of-the-art language model, in the
context of text understanding. We begin by describing the architecture and training process of GPT-3, and
provide an overview of its impressive performance across a wide range of natural language processing
tasks, such as language translation, question-answering, and text completion. Throughout this research
project, a summarizing tool was also created to help us retrieve content from any types of document,
specifically IELTS (0) Reading Test data in this project. We also aimed to improve the accuracy of the
summarizing, as well as question-answering capabilities of GPT-3 (0) via long text
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc
With the recent developments in the field of Natural Language Processing, there has been a rise in the use
of different architectures for Neural Machine Translation. Transformer architectures are used to achieve
state-of-the-art accuracy, but they are very computationally expensive to train. Everyone cannot have such
setups consisting of high-end GPUs and other resources. We train our models on low computational
resources and investigate the results. As expected, transformers outperformed other architectures, but
there were some surprising results. Transformers consisting of more encoders and decoders took more
time to train but had fewer BLEU scores. LSTM performed well in the experiment and took comparatively
less time to train than transformers, making it suitable to use in situations having time constraints.
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...kevig
With the recent developments in the field of Natural Language Processing, there has been a rise in the use
of different architectures for Neural Machine Translation. Transformer architectures are used to achieve
state-of-the-art accuracy, but they are very computationally expensive to train. Everyone cannot have such
setups consisting of high-end GPUs and other resources. We train our models on low computational
resources and investigate the results. As expected, transformers outperformed other architectures, but
there were some surprising results. Transformers consisting of more encoders and decoders took more
time to train but had fewer BLEU scores. LSTM performed well in the experiment and took comparatively
less time to train than transformers, making it suitable to use in situations having time constraints
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...kevig
With the recent developments in the field of Natural Language Processing, there has been a rise in the use
of different architectures for Neural Machine Translation. Transformer architectures are used to achieve
state-of-the-art accuracy, but they are very computationally expensive to train. Everyone cannot have such
setups consisting of high-end GPUs and other resources. We train our models on low computational
resources and investigate the results. As expected, transformers outperformed other architectures, but
there were some surprising results. Transformers consisting of more encoders and decoders took more
time to train but had fewer BLEU scores. LSTM performed well in the experiment and took comparatively
less time to train than transformers, making it suitable to use in situations having time constraints.
Foundation models are pre-trained models that can be fine-tuned on specific tasks or domains. These highly adaptable and high-performing models find applications across diverse domains, including Natural Language Processing (NLP), computer vision, and multimodal tasks.
Transformer Models have taken over most of the Natural language Inference tasks. In recent
times they have proved to beat several benchmarks. Chunking means splitting the sentences into
tokens and then grouping them in a meaningful way. Chunking is a task that has gradually
moved from POS tag-based statistical models to neural nets using Language models such as
LSTM, Bidirectional LSTMs, attention models, etc. Deep neural net Models are deployed
indirectly for classifying tokens as different tags defined under Named Recognition Tasks. Later
these tags are used in conjunction with pointer frameworks for the final chunking task. In our
paper, we propose an Ensemble Model using a fine-tuned Transformer Model and a recurrent
neural network model together to predict tags and chunk substructures of a sentence. We
analyzed the shortcomings of the transformer models in predicting different tags and then
trained the BILSTM+CNN accordingly to compensate for the same.
A comprehensive guide to prompt engineering.pdfStephenAmell4
Prompt engineering is the practice of designing and refining specific text prompts to guide transformer-based language models, such as Large Language Models (LLMs), in generating desired outputs. It involves crafting clear and specific instructions and allowing the model sufficient time to process information. By carefully engineering prompts, practitioners can harness the capabilities of LLMs to achieve different goals.
A comprehensive guide to prompt engineering.pdfStephenAmell4
Prompt engineering is the practice of designing and refining specific text prompts to guide transformer-based language models, such as Large Language Models (LLMs), in generating desired outputs. It involves crafting clear and specific instructions and allowing the model sufficient time to process information.
Prompt engineering is the practice of designing and refining specific text prompts to guide transformer-based language models, such as Large Language Models (LLMs), in generating desired outputs. It involves crafting clear and specific instructions and allowing the model sufficient time to process information. By carefully engineering prompts, practitioners can harness the capabilities of LLMs to achieve different goals.
STOCKGRAM : DEEP LEARNING MODEL FOR DIGITIZING FINANCIAL COMMUNICATIONS VIA N...kevig
This paper proposes a deep learning model, StockGram, to automate financial communications via natural language generation. StockGram is a seq2seq model that generates short and coherent versions of financial news reports based on the client's point of interest from numerous pools of verified resources. The proposed model is developed to mitigate the pain points of advisors who invest numerous hours while scanning through these news reports manually. StockGram leverages bi-directional LSTM cells that allows a recurrent system to make its prediction based on both past and future word sequences and hence predicts the next word in the sequence more precisely. The proposed model utilizes custom word-embeddings, GloVe, which incorporates global statistics to generate vector representations of news articles in an unsupervised manner and allows the model to converge faster. StockGram is evaluated based on the semantic closeness of the generated report to the provided prime words.
STOCKGRAM : DEEP LEARNING MODEL FOR DIGITIZING FINANCIAL COMMUNICATIONS VIA N...ijnlc
This paper proposes a deep learning model, StockGram, to automate financial communications via natural language generation. StockGram is a seq2seq model that generates short and coherent versions of financial news reports based on the client's point of interest from numerous pools of verified resources. The proposed model is developed to mitigate the pain points of advisors who invest numerous hours while
scanning through these news reports manually. StockGram leverages bi-directional LSTM cells that allows a recurrent system to make its prediction based on both past and future word sequences and hence predicts the next word in the sequence more precisely. The proposed model utilizes custom word-embeddings, GloVe, which incorporates global statistics to generate vector representations of news articles in an unsupervised manner and allows the model to converge faster. StockGram is evaluated based on the semantic closeness of the generated report to the provided prime words.
Chunking means splitting the sentences into tokens and then grouping them in a meaningful way. When it comes to high-performance chunking systems, transformer models have proved to be the state of the art benchmarks. To perform chunking as a task it requires a large-scale high quality annotated corpus where each token is attached with a particular tag similar as that of Named Entity Recognition Tasks. Later these tags are used in conjunction with pointer frameworks to find the final chunk. To solve this for a specific domain problem, it becomes a highly costly affair in terms of time and resources to manually annotate and produce a large-high-quality training set. When the domain is specific and diverse, then cold starting becomes even more difficult because of the expected large number of manually annotated queries to cover all aspects. To overcome the problem, we applied a grammar-based text generation mechanism where instead of annotating a sentence we annotate using grammar templates. We defined various templates corresponding to different grammar rules. To create a sentence we used these templates along with the rules where symbol or terminal values were chosen from the domain data catalog. It helped us to create a large number of annotated queries. These annotated queries were used for training the machine learning model using an ensemble transformer-based deep neural network model [24.] We found that grammar-based annotation was useful to solve domain-based chunks in input query sentences without any manual annotation where it was found to achieve a classification F1 score of 96.97% in classifying the tokens for the out of template queries.
Similar to Analysis of the evolution of advanced transformer-based language models: Experiments on opinion mining (20)
Convolutional neural network with binary moth flame optimization for emotion ...IAESIJAI
Electroencephalograph (EEG) signals have the ability of real-time reflecting brain activities. Utilizing the EEG signal for analyzing human emotional states is a common study. The EEG signals of the emotions aren’t distinctive and it is different from one person to another as every one of them has different emotional responses to same stimuli. Which is why, the signals of the EEG are subject dependent and proven to be effective for the subject dependent detection of the Emotions. For the purpose of achieving enhanced accuracy and high true positive rate, the suggested system proposed a binary moth flame optimization (BMFO) algorithm for the process of feature selection and convolutional neural networks (CNNs) for classifications. In this proposal, optimum features are chosen with the use of accuracy as objective function. Ultimately, optimally chosen features are classified after that with the use of a CNN for the purpose of discriminating different emotion states.
A novel ensemble model for detecting fake newsIAESIJAI
Due the growing proliferation of fake news over the past couple of years, our objective in this paper is to propose an ensemble model for the automatic classification of article news as being either real or fake. For this purpose, we opt for a blending technique that combines three models, namely bidirectional long short-term memory (Bi-LSTM), stochastic gradient descent classifier and ridge classifier. The implementation of the proposed model (i.e. BI-LSR) on real world datasets, has shown outstanding results. In fact, it achieved an accuracy score of 99.16%. Accordingly, this ensemble learning has proven to do perform better than individual conventional machine learning and deep learning models as well as many ensemble learning approaches cited in the literature.
K-centroid convergence clustering identification in one-label per type for di...IAESIJAI
Disease prediction is a high demand field which requires significant support from machine learning (ML) to enhance the result efficiency. The research works on application of K-means clustering supervised classification in disease prediction where each class only has one labeled data. The K-centroid convergence clustering identification (KC3 I) system is based on semi-K-means clustering but only requires single labeled data per class for the training process with the training dataset to update the centroid. The KC3 I model also includes a dictionary box to index all the input centroids before and after the updating process. Each centroid matches with a corresponding label inside this box. After the training process, each time the input features arrive, the trained centroid will put them to its cluster depending on the Euclidean distance, then convert them into the specific class name, which is coherent to that centroid index. Two validation stages were carried out and accomplished the expectation in terms of precision, recall, F1-score, and absolute accuracy. The last part demonstrates the possibility of feature reduction by selecting the most crucial feature with the extra tree classifier method. Total data are fed into the KC3 I system with the most important features and remain the same accuracy.
Plant leaf detection through machine learning based image classification appr...IAESIJAI
Since maize is a staple diet for people, especially vegetarians and vegans, maize leaf disease has a significant influence here on the food industry including maize crop productivity. Therefore, it should be understood that maize quality must be optimal; yet, to do so, maize must be safeguarded from several illnesses. As a result, there is a great demand for such an automated system that can identify the condition early on and take the appropriate action. Early disease identification is crucial, but it also poses a major obstacle. As a result, in this research project, we adopt the fundamental k-nearest neighbor (KNN) model and concentrate on building and developing the enhanced k-nearest neighbor (EKNN) model. EKNN aids in identifying several classes of disease. To gather discriminative, boundary, pattern, and structurally linked information, additional high-quality fine and coarse features are generated. This information is then used in the classification process. The classification algorithm offers high-quality gradient-based features. Additionally, the proposed model is assessed using the Plant-Village dataset, and a comparison with many standard classification models using various metrics is also done.
Backbone search for object detection for applications in intrusion warning sy...IAESIJAI
In this work, we propose a novel backbone search method for object detection for applications in intrusion warning systems. The goal is to find a compact model for use in embedded thermal imaging cameras widely used in intrusion warning systems. The proposed method is based on faster region-based convolutional neural network (Faster R-CNN) because it can detect small objects. Inspired by EfficientNet, the sought-after backbone architecture is obtained by finding the most suitable width scale for the base backbone (ResNet50). The evaluation metrics are mean average precision (mAP), number of parameters, and number of multiply–accumulate operations (MACs). The experimental results showed that the proposed method is effective in building a lightweight neural network for the task of object detection. The obtained model can keep the predefined mAP while minimizing the number of parameters and computational resources. All experiments are executed elaborately on the person detection in intrusion warning systems (PDIWS) dataset.
Deep learning method for lung cancer identification and classificationIAESIJAI
Lung cancer (LC) is calming many lives and is becoming a serious cause of concern. The detection of LC at an early stage assists the chances of recovery. Accuracy of detection of LC at an early stage can be improved with the help of a convolutional neural network (CNN) based deep learning approach. In this paper, we present two methodologies for Lung cancer detection (LCD) applied on Lung image database consortium (LIDC) and image database resource initiative (IDRI) data sets. Classification of these LC images is carried out using support vector machine (SVM), and deep CNN. The CNN is trained with i) multiple batches and ii) single batch for LC image classification as non cancer and cancer image. All these methods are being implemented in MATLAB. The accuracy of classification obtained by SVM is 65%, whereas deep CNN produced detection accuracy of 80% and 100% respectively for multiple and single batch training. The novelty of our experimentation is near 100% classification accuracy obtained by our deep CNN model when tested on 25 Lung computed tomography (CT) test images each of size 512×512 pixels in less than 20 iterations as compared to the research work carried out by other researchers using cropped LC nodule images.
Optically processed Kannada script realization with Siamese neural network modelIAESIJAI
Optical character recognition (OCR) is a technology that allows computers to recognize and extract text from images or scanned documents. It is commonly used to convert printed or handwritten text into machine-readable format. This Study presents an OCR system on Kannada Characters based on siamese neural network (SNN). Here the SNN, a Deep neural network which comprises of two identical convolutional neural network (CNN) compare the script and ranks based on the dissimilarity. When lesser dissimilarity score is identified, prediction is done as character match. In this work the authors use 5 classes of Kannada characters which were initially preprocessed using grey scaling and convert it to pgm format. This is directly input into the Deep convolutional network which is learnt from matching and non-matching image between the CNN with contrastive loss function in Siamese architecture. The Proposed OCR system uses very less time and gives more accurate results as compared to the regular CNN. The model can become a powerful tool for identification, particularly in situations where there is a high degree of variation in writing styles or limited training data is available.
Embedded artificial intelligence system using deep learning and raspberrypi f...IAESIJAI
Melanoma is a kind of skin cancer that originates in melanocytes responsible for producing melanin, it can be a severe and potentially deadly form of cancer because it can metastasize to other regions of the body if not detected and treated early. To facilitate this process, Recently, various computer-assisted low-cost, reliable, and accurate diagnostic systems have been proposed based on artificial intelligence (AI) algorithms, particularly deep learning techniques. This work proposed an innovative and intelligent system that combines the internet of things (IoT) with a Raspberry Pi connected to a camera and a deep learning model based on the deep convolutional neural network (CNN) algorithm for real-time detection and classification of melanoma cancer lesions. The key stages of our model before serializing to the Raspberry Pi: Firstly, the preprocessing part contains data cleaning, data transformation (normalization), and data augmentation to reduce overfitting when training. Then, the deep CNN algorithm is used to extract the features part. Finally, the classification part with applied Sigmoid Activation Function. The experimental results indicate the efficiency of our proposed classification system as we achieved an accuracy rate of 92%, a precision of 91%, a sensitivity of 91%, and an area under the curve- receiver operating characteristics (AUC-ROC) of 0.9133.
Deep learning based biometric authentication using electrocardiogram and irisIAESIJAI
Authentication systems play an important role in wide range of applications. The traditional token certificate and password-based authentication systems are now replaced by biometric authentication systems. Generally, these authentication systems are based on the data obtained from face, iris, electrocardiogram (ECG), fingerprint and palm print. But these types of models are unimodal authentication, which suffer from accuracy and reliability issues. In this regard, multimodal biometric authentication systems have gained huge attention to develop the robust authentication systems. Moreover, the current development in deep learning schemes have proliferated to develop more robust architecture to overcome the issues of tradition machine learning based authentication systems. In this work, we have adopted ECG and iris data and trained the obtained features with the help of hybrid convolutional neural network- long short-term memory (CNN-LSTM) model. In ECG, R peak detection is considered as an important aspect for feature extraction and morphological features are extracted. Similarly, gabor-wavelet, gray level co-occurrence matrix (GLCM), gray level difference matrix (GLDM) and principal component analysis (PCA) based feature extraction methods are applied on iris data. The final feature vector is obtained from MIT-BIH and IIT Delhi Iris dataset which is trained and tested by using CNN-LSTM. The experimental analysis shows that the proposed approach achieves average accuracy, precision, and F1-core as 0.985, 0.962 and 0.975, respectively.
Hybrid channel and spatial attention-UNet for skin lesion segmentationIAESIJAI
Melanoma is a type of skin cancer which has affected many lives globally. The American Cancer Society research has suggested that it a serious type of skin cancer and lead to mortality but it is almost 100% curable if it is detected and treated in its early stages. Currently automated computer vision-based schemes are widely adopted but these systems suffer from poor segmentation accuracy. To overcome these issue, deep learning (DL) has become the promising solution which performs extensive training for pattern learning and provide better classification accuracy. However, skin lesion segmentation is affected due to skin hair, unclear boundaries, pigmentation, and mole. To overcome this issue, we adopt UNet based deep learning scheme and incorporated attention mechanism which considers low level statistics and high-level statistics combined with feedback and skip connection module. This helps to obtain the robust features without neglecting the channel information. Further, we use channel attention, spatial attention modulation to achieve the final segmentation. The proposed DL based scheme is instigated on publically available dataset and experimental investigation shows that the proposed Hybrid Attention UNet approach achieves average performance as 0.9715, 0.9962, 0.9710.
Photoplethysmogram signal reconstruction through integrated compression sensi...IAESIJAI
The transmission of photoplethysmogram (PPG) signals in real-time is extremely challenging and facilitates the use of an internet of things (IoT) environment for healthcare- monitoring. This paper proposes an approach for PPG signal reconstruction through integrated compression sensing and basis function aware shallow learning (CSBSL). Integrated-CSBSL approach for combined compression of PPG signals via multiple channels thereby improving the reconstruction accuracy for the PPG signals essential in healthcare monitoring. An optimal basis function aware shallow learning procedure is employed on PPG signals with prior initialization; this is further fine-tuned by utilizing the knowledge of various other channels, which exploit the further sparsity of the PPG signals. The proposed method for learning combined with PPG signals retains the knowledge of spatial and temporal correlation. The proposed Integrated-CSBSL approach consists of two steps, in the first step the shallow learning based on basis function is carried out through training the PPG signals. The proposed method is evaluated using multichannel PPG signal reconstruction, which potentially benefits clinical applications through PPG monitoring and diagnosis.
Speaker identification under noisy conditions using hybrid convolutional neur...IAESIJAI
Speaker identification is biometrics that classifies or identifies a person from other speakers based on speech characteristics. Recently, deep learning models outperformed conventional machine learning models in speaker identification. Spectrograms of the speech have been used as input in deep learning-based speaker identification using clean speech. However, the performance of speaker identification systems gets degraded under noisy conditions. Cochleograms have shown better results than spectrograms in deep learning-based speaker recognition under noisy and mismatched conditions. Moreover, hybrid convolutional neural network (CNN) and recurrent neural network (RNN) variants have shown better performance than CNN or RNN variants in recent studies. However, there is no attempt conducted to use a hybrid CNN and enhanced RNN variants in speaker identification using cochleogram input to enhance the performance under noisy and mismatched conditions. In this study, a speaker identification using hybrid CNN and the gated recurrent unit (GRU) is proposed for noisy conditions using cochleogram input. VoxCeleb1 audio dataset with real-world noises, white Gaussian noises (WGN) and without additive noises were employed for experiments. The experiment results and the comparison with existing works show that the proposed model performs better than other models in this study and existing works.
Multi-channel microseismic signals classification with convolutional neural n...IAESIJAI
Identifying and classifying microseismic signals is essential to warn of mines’ dangers. Deep learning has replaced traditional methods, but labor-intensive manual identification and varying deep learning outcomes pose challenges. This paper proposes a transfer learning-based convolutional neural network (CNN) method called microseismic signals-convolutional neural network (MS-CNN) to automatically recognize and classify microseismic events and blasts. The model was instructed on a limited sample of data to obtain an optimal weight model for microseismic waveform recognition and classification. A comparative analysis was performed with an existing CNN model and classical image classification models such as AlexNet, GoogLeNet, and ResNet50. The outcomes demonstrate that the MS-CNN model achieved the best recognition and classification effect (99.6% accuracy) in the shortest time (0.31 s to identify 277 images in the test set). Thus, the MS-CNN model can efficiently recognize and classify microseismic events and blasts in practical engineering applications, improving the recognition timeliness of microseismic signals and further enhancing the accuracy of event classification.
Sophisticated face mask dataset: a novel dataset for effective coronavirus di...IAESIJAI
Efficient and accurate coronavirus disease (COVID-19) surveillance necessitates robust identification of individuals wearing face masks. This research introduces the sophisticated face mask dataset (SFMD), a comprehensive compilation of high-quality face mask images enriched with detailed annotations on mask types, fits, and usage patterns. Leveraging cutting-edge deep learning models—EfficientNet-B2, ResNet50, and MobileNet-V2—, we compare SFMD against two established benchmarks: the real-world masked face dataset (RMFD) and the masked face recognition dataset (MFRD). Across all models, SFMD consistently outperforms RMFD and MFRD in key metrics, including accuracy, precision, recall, and F1 score. Additionally, our study demonstrates the dataset's capability to cultivate robust models resilient to intricate scenarios like low-light conditions and facial occlusions due to accessories or facial hair.
Transfer learning for epilepsy detection using spectrogram imagesIAESIJAI
Epilepsy stands out as one of the common neurological diseases. The neural activity of the brain is observed using electroencephalography (EEG). Manual inspection of EEG brain signals is a slow and arduous process, which puts heavy load on neurologists and affects their performance. The aim of this study is to find the best result of classification using the transfer learning model that automatically identify the epileptic and the normal activity, to classify EEG signals by using images of spectrogram which represents the percentage of energy for each coefficient of the continuous wavelet. Dataset includes the EEG signals recorded at monitoring unit of epilepsy used in this study to presents an application of transfer learning by comparing three models Alexnet, visual geometry group (VGG19) and residual neural network ResNet using different combinations with seven different classifiers. This study tested the models and reached a different value of accuracy and other metrics used to judge their performances, and as a result the best combination has been achieved with ResNet combined with support vector machine (SVM) classifier that classified EEG signals with a high success rate using multiple performance metrics such as 97.22% accuracy and 2.78% the value of the error rate.
Deep neural network for lateral control of self-driving cars in urban environ...IAESIJAI
The exponential growth of the automotive industry clearly indicates that self-driving cars are the future of transportation. However, their biggest challenge lies in lateral control, particularly in urban bottlenecking environments, where disturbances and obstacles are abundant. In these situations, the ego vehicle has to follow its own trajectory while rapidly correcting deviation errors without colliding with other nearby vehicles. Various research efforts have focused on developing lateral control approaches, but these methods remain limited in terms of response speed and control accuracy. This paper presents a control strategy using a deep neural network (DNN) controller to effectively keep the car on the centerline of its trajectory and adapt to disturbances arising from deviations or trajectory curvature. The controller focuses on minimizing deviation errors. The Matlab/Simulink software is used for designing and training the DNN. Finally, simulation results confirm that the suggested controller has several advantages in terms of precision, with lateral deviation remaining below 0.65 meters, and rapidity, with a response time of 0.7 seconds, compared to traditional controllers in solving lateral control.
Attention mechanism-based model for cardiomegaly recognition in chest X-Ray i...IAESIJAI
Recently, cardiovascular diseases (CVDs) have become a rapidly growing problem in the world, especially in developing countries. The latter are facing a lifestyle change that introduces new risk factors for heart disease, that requires a particular and urgent interest. Besides, cardiomegaly is a sign of cardiovascular diseases that refers to various conditions; it is associated with the heart enlargement that can be either transient or permanent depending on certain conditions. Furthermore, cardiomegaly is visible on any imaging test including Chest X-Radiation (X-Ray) images; which are one of the most common tools used by Cardiologists to detect and diagnose many diseases. In this paper, we propose an innovative deep learning (DL) model based on an attention module and MobileNet architecture to recognize Cardiomegaly patients using the popular Chest X-Ray8 dataset. Actually, the attention module captures the spatial relationship between the relevant regions in Chest X-Ray images. The experimental results show that the proposed model achieved interesting results with an accuracy rate of 81% which makes it suitable for detecting cardiomegaly disease.
Efficient commodity price forecasting using long short-term memory modelIAESIJAI
Predicting commodity prices, particularly food prices, is a significant concern for various stakeholders, especially in regions that are highly sensitive to commodity price volatility. Historically, many machine learning models like autoregressive integrated moving average (ARIMA) and support vector machine (SVM) have been suggested to overcome the forecasting task. These models struggle to capture the multifaceted and dynamic factors influencing these prices. Recently, deep learning approaches have demonstrated considerable promise in handling complex forecasting tasks. This paper presents a novel long short-term memory (LSTM) network-based model for commodity price forecasting. The model uses five essential commodities namely bread, meat, milk, oil, and petrol. The proposed model focuses on advanced feature engineering which involves moving averages, price volatility, and past prices. The results reveal that our model outperforms traditional methods as it achieves 0.14, 3.04%, and 98.2% for root mean square error (RMSE), mean absolute percentage error (MAPE), and R-squared (R2 ), respectively. In addition to the simplicity of the model, which consists of an LSTM single-cell architecture that reduced the training time to a few minutes instead of hours. This paper contributes to the economic literature on price prediction using advanced deep learning techniques as well as provides practical implications for managing commodity price instability globally.
1-dimensional convolutional neural networks for predicting sudden cardiacIAESIJAI
Sudden cardiac arrest (SCA) is a serious heart problem that occurs without symptoms or warning. SCA causes high mortality. Therefore, it is important to estimate the incidence of SCA. Current methods for predicting ventricular fibrillation (VF) episodes require monitoring patients over time, resulting in no complications. New technologies, especially machine learning, are gaining popularity due to the benefits they provide. However, most existing systems rely on manual processes, which can lead to inefficiencies in disseminating patient information. On the other hand, existing deep learning methods rely on large data sets that are not publicly available. In this study, we propose a deep learning method based on one-dimensional convolutional neural networks to learn to use discrete fourier transform (DFT) features in raw electrocardiogram (ECG) signals. The results showed that our method was able to accurately predict the onset of SCA with an accuracy of 96% approximately 90 minutes before it occurred. Predictions can save many lives. That is, optimized deep learning models can outperform manual models in analyzing long-term signals.
A deep learning-based approach for early detection of disease in sugarcane pl...IAESIJAI
In many regions of the nation, agriculture serves as the primary industry. The farming environment now faces a number of challenges to farmers. One of the major concerns, and the focus of this research, is disease prediction. A methodology is suggested to automate a process for identifying disease in plant growth and warning farmers in advance so they can take appropriate action. Disease in crop plants has an impact on agricultural production. In this work, a novel DenseNet-support vector machine: explainable artificial intelligence (DNet-SVM: XAI) interpretation that combines a DenseNet with support vector machine (SVM) and local interpretable model-agnostic explanation (LIME) interpretation has been proposed. DNet-SVM: XAI was created by a series of modifications to DenseNet201, including the addition of a support vector machine (SVM) classifier. Prior to using SVM to identify if an image is healthy or un-healthy, images are first feature extracted using a convolution network called DenseNet. In addition to offering a likely explanation for the prediction, the reasoning is carried out utilizing the visual cue produced by the LIME. In light of this, the proposed approach, when paired with its determined interpretability and precision, may successfully assist farmers in the detection of infected plants and recommendation of pesticide for the identified disease.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Analysis of the evolution of advanced transformer-based language models: Experiments on opinion mining
1. IAES International Journal of Artificial Intelligence (IJ-AI)
Vol. 12, No. 4, December 2023, pp. 1995∼2010
ISSN: 2252-8938, DOI: 10.11591/ijai.v12.i4.pp1995-2010 r 1995
Analysis of the evolution of advanced transformer-based
language models: experiments on opinion mining
Nour Eddine Zekaoui, Siham Yousfi, Maryem Rhanoui, Mounia Mikram
Meridian Team, LYRICA Laboratory, School of Information Sciences, Rabat, Morocco
Article Info
Article history:
Received Jan 5, 2023
Revised Jan 16, 2023
Accepted Mar 10, 2023
Keywords:
Natural language processing
Opinion mining
Transformer-based models
ABSTRACT
Opinion mining, also known as sentiment analysis, is a subfield of natural lan-
guage processing (NLP) that focuses on identifying and extracting subjective
information in textual material. This can include determining the overall senti-
ment of a piece of text (e.g., positive or negative), as well as identifying specific
emotions or opinions expressed in the text, that involves the use of advanced
machine and deep learning techniques. Recently, transformer-based language
models make this task of human emotion analysis intuitive, thanks to the atten-
tion mechanism and parallel computation. These advantages make such models
very powerful on linguistic tasks, unlike recurrent neural networks that spend a
lot of time on sequential processing, making them prone to fail when it comes to
processing long text. The scope of our paper aims to study the behaviour of the
cutting-edge Transformer-based language models on opinion mining and pro-
vide a high-level comparison between them to highlight their key particularities.
Additionally, our comparative study shows leads and paves the way for produc-
tion engineers regarding the approach to focus on and is useful for researchers
as it provides guidelines for future research subjects.
This is an open access article under the CC BY-SA license.
Corresponding Author:
Nour Eddine Zekaoui
Meridian Team, LYRICA Laboratory, School of Information Sciences
Rabat, Morocco
Email: noureddinezekaoui@gmail.com, nour-eddine.zekaoui@esi.ac.ma
1. INTRODUCTION
Over the past few years, interest in natural language processing (NLP) [1] has increased significantly.
Today, several applications are investing massively in this new technology, such as extending recommender
systems [2], [3], uncovering new insights in the health industry [4], [5], and unraveling e-reputation and opin-
ion mining [6], [7]. Opinion mining is an approach to computational linguistics and NLP that automatically
identifies the emotional tone, sentiment, or thoughts behind a body of text. As a result, it plays a vital role
in driving business decisions in many industries. However, seeking customer satisfaction is costly expensive.
Indeed, mining user feedback regarding the products offered, is the most accurate way to adapt strategies and
future business plans. In recent years, opinion mining has seen considerable progress, with applications in
social media and review websites. Recommendation may be staff-oriented [2] or user-oriented [8] and should
be tailored to meet customer needs and behaviors.
Nowadays, analyzing people’s emotions has become more intuitive thanks to the availability of many
large pre-trained language models such as bidirectional encoder representations from transformers (BERT) [9]
and its variants. These models use the seminal transformer architecture [10], which is based solely on attention
mechanisms, to build robust language models for a variety of semantic tasks, including text classification.
Journal homepage: http://ijai.iaescore.com
2. 1996 r ISSN: 2252-8938
Moreover, there has been a surge in opinion mining text datasets, specifically designed to challenge NLP
models and enhance their performance. These datasets are aimed at enabling models to imitate or even exceed
human level performance, while introducing more complex features.
Even though many papers have addressed NLP topics for opinion mining using high-performance
deep learning models, it is still challenging to determine their performance concretely and accurately due to
variations in technical environments and datasets. Therefore, to address these issues, our paper aims to study
the behaviour of the cutting-edge transformer-based models on textual material and reveal their differences.
Although, it focuses on applying both transformer encoders and decoders, such as BERT [9] and generative
pre-trained transformer (GPT) [11], respectively, and their improvements on a benchmark dataset. This enable
a credible assessment of their performance and understanding their advantages, allowing subject matter experts
to clearly rank the models. Furthermore, through ablations, we show the impact of configuration choices on
the final results.
2. BACKGROUND
2.1. Transformer
The transformer [10], as illustrated in Figure 1, is an encoder-decoder model dispensing entirely with
recurrence and convolutions. Instead, it leverages the attention mechanism to compute high-level contextual-
ized embeddings. Being the first model to rely solely on attention mechanisms, it is able to address the issues
commonly associated with recurrent neural networks, which factor computation along symbol positions of in-
put and output sequences, and then precludes parallelization within samples. Despite this, the transformer is
highly parallelizable and requires significantly less time to train. In the upcoming sections, we will highlight the
recent breakthroughs in NLP involving transformer that changed the field overnight by introducing its designs,
such as BERT [9] and its improvements.
Figure 1. The transformer model architecture [10]
Int J Artif Intell, Vol. 12, No. 4, December 2023: 1995–2010
3. Int J Artif Intell ISSN: 2252-8938 r 1997
2.2. BERT
BERT [9] is pre-trained using a combination of masked language modeling (MLM) and next sentence
prediction (NSP) objectives. It provides high-level contextualized embeddings grasping the meaning of words
in different contexts through global attention. As a result, the pre-trained BERT model can be fine-tuned for a
wide range of downstream tasks, such as question answering and text classification, without substantial task-
specific architecture modifications.
BERT and its variants allow the training of modern data-intensive models. Moreover, they are able to
capture the contextual meaning of each piece of text in a way that traditional language models are unfit to do,
while being quicker to develop and yielding better results with less data. On the other hand, BERT and other
large neural language models are very expensive and computationally intensive to train/fine-tune and make
inference.
2.3. GPT-I, II, III
GPT [11] is the first causal or autoregressive transformer-based model pre-trained using language
modeling on a large corpus with long-range dependencies. However, its bigger an optimized version called
GPT-2 [12], was pre-trained on WebText. Likewise, GPT-3 [13] is architecturally similar to its predecessors. Its
higher level of accuracy is attributed to its increased capacity and greater number of parameters, and it was pre-
trained on Common Crawl. The OpenAI GPT family models has taken pre-trained language models by storm,
they are very powerful on realistic human text generation and many other miscellaneous NLP tasks. Therefore,
a small amount of input text can be used to generate large amount of high-quality text, while maintaining
semantic and syntactic understanding of each word.
2.4. ALBERT
A lite BERT (ALBERT) [14] was proposed to address the problems associated with large models. It
was specifically designed to provide contextualized natural language representations to improve the results on
downstream tasks. However, increasing the model size to pre-train embeddings becomes harder due to memory
limitations and longer training time. For this reason, this model arose.
ALBERT is a lighter version of BERT, in which next sentence prediction (NSP) is replaced by sentence
order prediction (SOP). In addition to that, it employs two parameter-reduction techniques to reduce memory
consumption and improve training time of BERT without hurting performance:
− Splitting the embedding matrix into two smaller matrices to easily grow the hidden size with fewer
parameters, ALBERT separates the hidden layers size from the size of the vocabulary embedding by
decomposing the embedding matrix of the vocabulary.
− Repeating layers split among groups to prevent the parameter from growing with the depth of the net-
work.
2.5. RoBERTa
The choice of language model hyper-parameters has a substantial impact on the final results. Hence,
robustly optimized BERT pre-training approach (RoBERTa) [15] is introduced to investigate the impact of
many key hyper-parameters along with data size on model performance. RoBERTa is based on Google’s BERT
[9] model and modifies key hyper-parameters, where the masked language modeling objective is dynamic and
the NSP objective is removed. It is an improved version of BERT, pre-trained with much larger mini-batches
and learning rates on a large corpus using self-supervised learning.
2.6. XLNet
The bidirectional property of transformer encoders, such as BERT [9], help them achieve better per-
formance than autoregressive language modeling based approaches. Nevertheless, BERT ignores dependency
between the positions masked, and suffers from a pretrain-finetune discrepancy when relying on corrupting the
input with masks. In view of these pros and cons, XLNet [16] has been proposed. XLNet is a generalized
autoregressive pretraining approach that allows learning bidirectional dependencies by maximizing the antici-
pated likelihood over all permutations of the factorization order. Furthermore, it overcomes the drawbacks of
BERT [9] due to its casual or autoregressive formulation, inspired from the transformer-XL [17].
Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui)
4. 1998 r ISSN: 2252-8938
2.7. DistilBERT
Unfortunately, the outstanding performance that comes with large-scale pretrained models is not
cheap. In fact, operating them on edge devices under constrained computational training or inference bud-
gets remains challenging. Against this backdrop, DistilBERT [18] (or Distilled BERT) has seen the light to
address the cited issues by leveraging knowledge distillation [19].
DistilBERT is similar to BERT, but it is smaller, faster, and cheaper. It has 40% less parameters than
BERT base, runs 40% faster, while preserving over 95% of BERT’s performance. It is trained using distillation
of the pretrained BERT base model.
2.8. XLM-RoBERTa
Pre-trained multilingual models at scale, such as multilingual BERT (mBERT) [9] and cross-lingual
language models (XLMs) [20], have led to considerable performance improvements for a wide variety of
cross-lingual transfer tasks, including question answering, sequence labeling, and classification. However, the
multilingual version of RoBERTa [15] called XLM-RoBERTa [21], pre-trained on the newly created 2.5TB
multilingual CommonCrawl corpus containing 100 different languages, has further pushed the performance. It
has shown strong improvements on low-resource languages compared to previous multilingual models.
2.9. BART
Bidirectional and auto-regressive transformer (BART) [22] is a generalization of BERT [9] and GPT
[11], it takes advantage of the standard transformer [10]. Concretely, it uses a bidirectional encoder and a
left-to-right decoder. It is trained by corrupting text with an arbitrary noising function and learning a model to
reconstruct the original text. BART has shown phenomenal success when fine-tuned on text generation tasks
such as translation, but also performs well for comprehension tasks like question answering and classification.
2.10. ConvBERT
While BERT [9] and its variants have recently achieved incredible performance gains in many NLP
tasks compared to previous models, BERT suffers from large computation cost and memory footprint due to
reliance on the global self-attention block. Although all its attention heads, BERT was found to be compu-
tationally redundant, since some heads simply need to learn local dependencies. Therefore, ConvBERT [23]
is a better version of BERT [9], where self-attention blocks are replaced with new mixed ones that leverage
convolutions to better model global and local context.
2.11. Reformer
Consistently, large transformer [10] models achieve state-of-the-art results in a large variety of linguis-
tic tasks, but training them on long sequences is costly challenging. To address this issue, the Reformer [24]
was introduced to improve the efficiency of transformers while holding the high performance and the smooth
training. Reformer is more efficient than transformer [10] thanks to locality-sensitive hashing attention and
reversible residual layers instead of the standard residuals, and axial position encoding and other optimizations.
2.12. T5
Transfer learning has emerged as one of the most influential techniques in NLP. Its efficiency in trans-
ferring knowledge to downstream tasks through fine-tuning has given birth to a range of innovative approaches.
One of these approaches is transfer learning with a unified text-to-text transformer (T5) [25], which consists
of a bidirectional encoder and a left-to-right decoder. This approach is reshaping the transfer learning land-
scape by leveraging the power of being pre-trained on a combination of unsupervised and supervised tasks and
reframing every NLP task into text-to-text format.
2.13. ELECTRA
Masked language modeling (MLM) approaches like BERT [9] have proven to be effective when trans-
ferred to downstream NLP tasks, although, they are expensive and require large amounts of compute. Effi-
ciently learn an encoder that classifies token replacements accurately (ELECTRA) [26] is a new pre-training
approach that aims to overcome these computation problems by training two Transformer models: the gener-
ator and the discriminator. ELECTRA trains on a replaced token detection objective, using the discriminator
to identify which tokens were replaced by the generator in the sequences. Unlike MLM-based models, ELEC-
TRA is defined over all input tokens rather than just a small subset that was masked, making it a more efficient
pre-training approach.
Int J Artif Intell, Vol. 12, No. 4, December 2023: 1995–2010
5. Int J Artif Intell ISSN: 2252-8938 r 1999
2.14. Longformer
While previous transformers were focusing on making changes to the pre-training methods, the long-
document transformer (Longformer) [27] comes to change the transformer’s self-attention mechanism. It has
became the de facto standard for tackling a wide range of complex NLP tasks, with an new attention mecha-
nism that scales linearly with sequence length, and then being able to easily process longer sequences. Long-
former’s new attention mechanism is a drop-in replacement for the standard self-attention and combines a local
windowed attention with a task motivated global attention. Simply, it replaces the transformer [10] attention
matrices with sparse matrices for higher training efficiency.
2.15. DeBERTa
DeBERTa [28] stands for decoding-enhanced BERT with disentangled attention. It is a pre-training
approach that extends Google’s BERT [9] and builds on the RoBERTa [15]. Despite being trained on only half
of the data used for RoBERTa, DeBERTa has been able to improve the efficiency of pre-trained models through
the use of two novel techniques:
− Disentangled attention (DA): an attention mechanism that computes the attention weights among words
using disentangled matrices based on two vectors that encode the content and the relative position of
each word respectively.
− Enhanced mask decoder (EMD): a pre-trained technique used to replace the output softmax layer. Thus,
incorporate absolute positions in the decoding layer to predict masked tokens for model pre-training.
3. APPROACH
Transformer-based pre-trained language models have led to substantial performance gains, but careful
comparison between different approaches is challenging. Therefore, we extend our study to uncover insights
regarding their fine-tuning process and main characteristics. Our paper first aims to study the behavior of
these models, following two approaches: a data-centric view focusing on the data state and quality, and a
model-centric view giving more attention to the models tweaks. Indeed, we will see how data processing
affects their performance and how adjustments and improvements made to the model over time is changing
its performance. Thus, we seek to end with some takeaways regarding the optimal setup that aids in cross-
validating a Transformer-based model, specifically model tuning hyper-parameters and data quality.
3.1. Models summary
In this section, we present the base versions’ details of the models introduced previously as shown
in Table A1. We aim to provide a fair comparison based on the following criteria: L-Number of transformer
layers, H-Hidden state size or model dimension, A-Number of attention heads, number of total parameters,
tokenization algorithm, data used for pre-training, training devices and computational cost, training objectives,
good performance tasks, and a short description regarding the model key points [29]. All these information
will help to understand the performance and behaviors of different transformer-based models and aid to make
the appropriate choice depending on the task and resources.
3.2. Configuration
It should be noted that we have used almost the same architecture building blocks for all our imple-
mented models as shown in Figure 2 and Figure 3 for both encoder and decoder based models, respectively.
In contrast, seq2seq models like BART are merely a bidirectional encoder pursued by an autoregressive de-
coder. Each model is fed with the three required inputs, namely input ids, token type ids, and attention mask.
However, for some models, the position embeddings are optional and can sometimes be completely ignored
(e.g RoBERTa), for this reason we have blurred them a bit in the figures. Furthermore, it is important to note
that we uniformed the dataset in lower cases, and we tokenized it with tokenizers based on WordPiece [30],
SentencePiece [31], and Byte-pair-encoding [32] algorithms.
In our experiments, we used a highly optimized setup using only the base version of each pre-trained
language model. For training and validation, we set a batch size of 8 and 4, respectively, and fine-tuned the
models for 4 epochs over the data with maximum sequence length of 384 for the intent of correspondence to
the majority of reviews’ lengths and computational capabilities. The AdamW optimizer is utilized to optimize
the models with a learning rate of 3e-5 and the epsilon (eps) used to improve numerical stability is set to 1e-6,
which is the default value. Furthermore, the weight decay is set to 0.001, while excluding bias, LayerNorm.bias,
Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui)
6. 2000 r ISSN: 2252-8938
and LayerNorm.weight from the decay weight when fine-tuning, and not decaying them when it is set to 0.000.
We implemented all of our models using PyTorch and transformers library from Hugging Face, and ran them
on an NVIDIA Tesla P100-PCIE GPU-Persistence-M (51G) GPU RAM.
Figure 2. The architecture of the transformer encoder-based models
3.3. Evaluation
Dataset to fine-tune our models, we used the IMDb movie review dataset [33]. A binary sentiment
classification dataset having 50K highly polar movie reviews labelled in a balanced way between positive
and negative. We chose it for our study because it is often used in research studies and is a very popular
resource for researchers working on NLP and ML tasks, particularly those related to sentiment analysis and text
classification due to its accessibility, size, balance and pre-processing. In other words, it is easily accessible
and widely available, with over 50K reviews well-balanced, with an equal number of positive and negative
reviews as shown in Figure 4. This helps prevent biases in the trained model. Additionally, it has already been
pre-processed with the text of each review cleaned and normalized.
Metrics to assess the performance of the fine-tuned transformers on the IMDb movie reviews dataset,
tracking the loss and accuracy learning curves for each model is an effective method. These curves can help
detect incorrect predictions and potential overfitting, which are crucial factors to consider in the evaluation
process. Moreover, widely-used metrics, namely accuracy, recall, precision, and F1-score are valuable to
consider when dealing with classification problems. These metrics can be defined as:
Precision =
TP
TP + FP
, Recall =
TP
TP + FN
, and F1 = 2 ×
Precision × Recall
Precision + Recall
(1)
Int J Artif Intell, Vol. 12, No. 4, December 2023: 1995–2010
7. Int J Artif Intell ISSN: 2252-8938 r 2001
Figure 3. The architecture of the transformer decoder-based models
Figure 4. Positive and negative reviews distribution
Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui)
8. 2002 r ISSN: 2252-8938
4. RESULTS
In this section, we present the fine-tuning main results of our implemented transformer-based lan-
guage models on the opinion mining task on the IMDb movie reviews dataset. Typically, all the fine-tuned
models perform well with fairly high performance, except the three autoregressive models: GPT, GPT-2,
and Reformer, as shown in Table 1. The best model, ELECTRA, provides an F1-score of 95.6 points, fol-
lowed by RoBERTa, Longformer, and DeBERTa, with an F1-score of 95.3, 95.1, and 95.1 points, respectively.
On the other hand, the worst model, GPT-2 provide an F1-score of 52.9 points as shown in Figure 5 and
Figure 6. From the results, it is clear that purely autoregressive models do not perform well on comprehension
tasks like sentiment classification, where sequences may require access to bidirectional contexts for better word
representation, therefore, good classification accuracy. Whereas, with autoencoding models taking advantage
of left and right contexts, we saw good performance gains. For instance, the autoregressive XLNet model is our
fourth best model in Table 1 with an F1 score of 94.9%, it incorporates modelling techniques from autoencod-
ing models into autoregressive models while avoiding and addressing limitations of encoders. The code and
fine-tuned models are available at [34].
Table 1. Transformer-based language models validation performance on the opinion mining IMDb dataset
Model Recall Precision F1 Accuracy
BERT 93.9 94.3 94.1 94.0
GPT 92.4 51.8 66.4 53.2
GPT-2 51.1 54.8 52.9 54.5
ALBERT 94.1 91.9 93.0 93.0
RoBERTa 96.0 94.6 95.3 95.3
XLNet 94.7 95.1 94.9 94.8
DistilBERT 94.3 92.7 93.5 93.4
XLM-RoBERTA 83.1 71.7 77.0 75.2
BART 96.0 93.3 94.6 94.6
ConvBERT 95.5 93.7 94.6 94.5
DeBERTa 95.2 95.0 95.1 95.1
ELECTRA 95.8 95.4 95.6 95.6
Longformer 95.9 94.3 95.1 95.0
Reformer 54.6 52.1 53.3 52.2
T5 94.8 93.4 94.0 93.9
Figure 5. Worst model: GPT-2 loss learning curve
5. ABLATION STUDY
In Table 2 and Figure 7, we demonstrate the importance of configuration choices through controlled
trials and ablation experiments. Indeed, the maximum length of the sequence and data cleaning are particularly
crucial. Thus, to make our ablation study credible, we fine-tuned our BERT model with the same setup,
changing only the sequence length (max-len) initially and cleaning the data (cd) at another time to observe how
they affect the performance of the model.
Int J Artif Intell, Vol. 12, No. 4, December 2023: 1995–2010
9. Int J Artif Intell ISSN: 2252-8938 r 2003
Figure 6. Worst model: GPT-2 acc learning curve
Table 2. Validation results of the BERT model based on different configurations, where cd stands for cleaned
data, meaning that the latest model (BERTmax-len=384, cd) is trained on an exhaustively cleaned text
Model Recall Precision F1 Accuracy
BERTmax-len=64 86.8% 84.7% 85.8% 85.6%
BERTmax-len=384 93.9% 94.3% 94.1% 94.0%
BERTmax-len=384, cd 92.6% 91.6% 92.1% 92.2%
Figure 7. Validation accuracy history of BERT model based on different configurations
5.1. Effects of hyper-parameters
The gap between the performance of BERTmax-len=64 and BERTmax-len=384 on the IMDb dataset is an
astounding 8.3 F1 points, as in Table 2, demonstrating how important this parameter is. Thereby, visualizing the
distribution of tokens or words count is the ultimate solution for defining the optimal and correct value of the
maximum length parameter that corresponds to all the training data points. Figure 8 illustrates the distribution
of the number of tokens in the IMDb movie reviews dataset, it shows that the majority of reviews are between
100 and 400 tokens in length. In this context, we chose 384 as the maximum length reference to study the effect
of the maximum length parameter, because it covers the majority of review lengths while conserving memory
and saving computational resources. It should be noted that the BERT model can process texts up to 512 tokens
in length. It is a consequence of the model architecture and can not be adjusted directly.
5.2. Effects of data cleaning
Traditional machine learning algorithms require extensive data cleaning before vectorizing the input
sequence and then feeding it to the model, with the aim of improving both reliability and quality of the data.
Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui)
10. 2004 r ISSN: 2252-8938
Therefore, the model can only focus on important features during training. Contrarily, the performance dropped
down dramatically by 2 F1 points when we cleaned the data for the BERT model. Indeed, the cleaning carried
out aims to normalize the words of each review. It includes lemmatization to group together the different forms
of the same word, stemming to reduce a word to its root, which is affixed to suffixes and prefixes, deletion of
URLs, punctuations, and patterns that do not contribute to the sentiment, as well as the elimination of all stop
words, except the words “no”, “nor”, and “not”, because their contribution to the sentiment can be tricky. For
instance, “Black Panther is boring” is a negative review, but “Black Panther is not boring” is a positive review.
This drop can be justified by the fact that BERT model and attention-based models need all the sequence words
to better capture the meaning of words’ contexts. However, with cleaning, words may be represented differently
from their meaning in the original sequence. Note that “not boring” and “boring” are completely different in
meaning, but if the stop word “not” is removed, we end up with two similar sequences, which is not good in
sentiment analysis context.
Figure 8. Distribution of the number of tokens for a better selection of the maximum sequence length
5.3. Effects of bias and training data
Carefully observing the accuracy and the loss learning curves in Figure 9 and Figure 10, we notice that
the validation loss starts to creep upward and the validation accuracy starts to go down. In this perspective, the
model in question continues to lose its ability to generalize well on unseen data. In fact, the model is relatively
biased due to the effect of the training data and data-drift issues related to the fine-tuning data. In this context,
we assume that the model starts to overfit. However, setting different dropouts, reducing the learning rate, or
even trying larger batches will not work. On the other hand, these strategies sometimes give worst results,
then a more critical overfitting problem. For this reason, pretraining these models on your industry data and
vocabulary and then fine-tuning them may be the best solution.
Figure 9. Best model: ELECTRA loss learning curve
Int J Artif Intell, Vol. 12, No. 4, December 2023: 1995–2010
11. Int J Artif Intell ISSN: 2252-8938 r 2005
Figure 10. Best model: ELECTRA acc. learning curve
6. CONCLUSION
In this paper, we presented a detailed comparison to highlight the main characteristics of transformer-
based pre-trained language models and what differentiates them from each other. Then, we studied their perfor-
mance on the opinion mining task. Thereby, we deduce the power of fine-tuning and how it helps in leveraging
the pre-trained models’ knowledge to achieve high accuracy on downstream tasks, even with the bias they came
with due to the pre-training data. Experimental results show how performant these models are. We have seen
the highest F1-score with the ELECTRA model with 95.6 points, across the IMDb dataset. Similarly, we found
that access to both left and right contexts is necessary when it comes to comprehension tasks like sentiment
classification. We have seen that autoregressive models like GPT, GPT-2, and Reformer perform poorly and
fail to achieve high accuracy. Nevertheless, XLNet has reached good results even though it is an autoregressive
model because it incorporates ideas taken from encoders characterized by their bidirectional property. Indeed,
all performances were nearby, including DistilBERT, which helps to gain incredible performance in less train-
ing time thanks to knowledge distillation. For example, for 4 epochs, BERT took 70 minutes to train, while
DistilBERT took 35 minutes, losing only 0.6 F1 points, but saving half the time taken by BERT. Moreover, our
ablation study shows that the maximum length of the sequence is one of the parameters having a significant im-
pact on the final results and must be carefully analyzed and adjusted. Likewise, data quality is a must for good
performance, data that will do not need to be processed, since extensive data cleaning processes may not help
the model capture local and global contexts in sequences, distilled sometimes with words removed or trimmed
during cleaning. Besides, we notice, that the majority of the models we fine-tuned on the IMDb dataset start
to overfit at a certain number of epochs, which can lead to biased models. However, good quality data is not
even enough, but pre-training a model on large amounts of business problem data and vocabulary may help on
preventing it from making wrong predictions and may help on reaching a high level of generalization.
ACKNOWLEDGMENTS
We are grateful to the Hugging Face team for their role in democratizing state-of-the-art machine
learning and natural language processing technology through open-source tools. Their commitment to pro-
viding valuable resources to the research community is highly appreciated, and we acknowledge their vital
contribution to the development of our article.
APPENDIX
Appendix for ”Analysis of the evolution of advanced transformer-based language models: experi-
ments on opining mining”.
Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui)
12. 2006 r ISSN: 2252-8938
Table
A1.
Summary
and
comparison
of
transformer-based
models
Model
L
H
A
Att.
type
Total
params
Tokenization
Training
data
Computational
cost
Training
objec-
tives
Performance
tasks
Short
description
GPT
12
512
12
Global
110M
Byte-pair-
encoding
[32]
Books
Corpus
(800M
words)
-
Autoregressive,
decoder
Zero-shot,
text
sum-
marization,
question
answering,
translation.
The
first
transformer-
based
autoregressive
and
causal
masking
model.
BERT
12
768
12
Global
110M
WordPiece
[30]
Books
Corpus
(800M
words)
and
English
Wikipedia
(2,500M
words)
4
days
on
4
Cloud
TPUs
in
Pod
configura-
tion.
Autoencoding,
encoder
(MLM
-
NSP)
Text
classification,
nat-
ural
language
inference,
question
answering.
The
first
transformer-
based
autoencoding
model,
that
uses
global
attention
to
provide
high-level
bidirectional
contextualization.
GPT-2
12
1600
12
Global
117M
Byte-pair-
encoding
WebText
(10B
words)
-
Autoregressive,
decoder
Zero-shot,
text
sum-
marization,
question
answering,
translation.
Optimized
and
bigger
than
GPT
and
performs
well
on
zero-shot
settings.
GPT-3
96
12288
96
Global
175B
Byte-pair-
encoding
Filtered
Common
Crawl,
WebText2,
Books1,
Books2,
and
Wikipedia
for
300B
words.
-
Autoregressive,
decoder
Text
summarization,
question
answering,
translation,
zero-shot,
one-shot,
few-shot.
Bigger
that
its
predeces-
sors.
ALBERT
12
768
12
Global
11M
SentencePiece
[31]
Books
Corpus
[35]
and
English
Wikipedia.
Cloud
TPU
V3
TPUs
number
ranges
from
64
to
512
(32h
ALBERT-
xxlarge).
Autoencoding,
encoder,
sentence-
ordering
predic-
tion
(SOP)
Semantic
similarity,
se-
mantic
relevance,
ques-
tion
answering,
reading
comprehension.
Smaller
and
similar
to
BERT
with
minimal
tweaks
including
the
split-
ting
of
layers
into
groups
via
cross-layer
parameter
sharing,
making
it
faster
and
reducing
memory
footprint.
DistilBERT
6
768
12
Global
66M
WordPiece
English
Wikipedia
and
Toront
Book
Corpus.
90
hours
on
8
16GB
V100
GPUs.
Autoencoding
(MLM),
encoder
Semantic
similarity,
se-
mantic
relevance,
ques-
tion
answering,
textual
entailment.
Pre-training
leveraging
knowledge
distillation
to
deliver
great
results
as
BERT
with
lower
latency.
Similar
to
BERT
model
but
smaller.
RoBERTa
12
1024
12
Global
125M
Byte-pair-
encoding
Book
Corpus
[35],
CC-News,
Open
Web
Text,
and
Stories
[36].
8
32GB
Nvidia
V100
GPUs.
Autoencoding
(Dynamic
MLM,
No
NSP),
encoder
Text
classification,
lan-
guage
inference,
ques-
tion
answering.
Pre-trained
with
large
batches
using
some
tricks
for
diverse
learning
like
dynamic
masking,
where
tokens
are
differently
masked
for
each
epoch.
Int J Artif Intell, Vol. 12, No. 4, December 2023: 1995–2010
13. Int J Artif Intell ISSN: 2252-8938 r 2007
XLM
12
2048
8
Global
-
Byte-pair
en-
coding
Wikipedias
of
the
XNLI
languages.
64
Volta
GPUs
for
the
language
modeling
tasks
and
8
GPUs
for
the
MT
tasks.
Autoencoding,
encoder,
causal
language
mod-
eling
(CLM),
masked
lan-
guage
modeling
(MLM),
and
translation
lan-
guage
modeling
(TLM).
Translation
tasks
and
NLU
cross-lingual
benchmarks.
By
being
trained
on
sev-
eral
pre-training
objectives
on
a
multilingual
corpus,
XLM
proves
that
multilin-
gual
pre-training
methods
have
a
strong
impact,
espe-
cially
on
the
performance
of
multilingual
tasks.
XLM-
RoBERTa
12
768
8
Global
270M
SentencePiece
CommonCrawl
Cor-
pus
in
100
languages.
100
32GB
Nvidia
V100
GPUs.
Autoencoding,
encoder,
MLM.
Translation
tasks
and
NLU
cross-lingual
benchmarks.
Using
only
the
masked
language
modeling
objec-
tive,
XLM-RoBERTa
uses
RoBERTa
tricks
on
XLM
approaches.
it
is
able
to
detect
the
input
language
by
itself
(100
languages).
ELECTRA
12
768
12
Global
110M
WordPiece
Wikipedia,
BooksCorpus,
Gi-
gas5
[37],
ClueWeb
2012-B,
and
Common
Crawl.
4
days
on
1
GPU.
Generator
(au-
toregressive,
replaced
token
detection)
and
discriminator
(Electra:
pre-
dicting
masked
tokens).
Sentiment
analysis,
lan-
guage
inference
tasks.
Replaced
token
detection
is
a
pre-training
objective
that
addresses
MLM
is-
sues
and
it
resutls
in
effi-
cient
performance.
DeBERTa
12
768
12
Global
(Disen-
tangled
atten-
tion)
125M
Byte-pair
en-
coding
Wikipedia,
BooksCorpus,
Red-
dit
content,
Stories,
STORIES.
10
days
64
V100
GPUs.
Autoencoding,
disentangled
attention
mech-
anism,
and
enhanced
mask
decoder.
DeBERTa
was
the
first
pretrained
model
to
beat
HLP
on
the
SuperGLUE
benchmark
[38].
DeBERTa
uses
RoBERTa
with
disentangled
atten-
tion
and
an
enhanced
mask
decoder
to
signif-
icantly
improve
model
performance
on
many
downstream
tasks
while
being
trained
only
on
half
of
the
data
used
in
RoBERTa
large
version.
XLNet
12
768
12
Global
110M
SentencePiece
Wikipedia,
BooksCorpus,
Gi-
gas5
[37],
ClueWeb
2012-B,
and
Common
Crawl.
5.5
days
on
512
TPU
v3
chips.
Autoregressive,
decoder
XLNet
achieved
state-
of-the-art
results
and
outperformed
BERT
on
20
downstream
task
inlcuding
senti-
ment
analysis,
question
answering,
reading
com-
prehension,
document
ranking.
XLNet
incorporates
ideas
from
transformer-XL
[17]
and
addresses
the
pretrain-
finetune
BERT’s
discrep-
ancy
being
more
capable
to
grasp
dependencies
be-
tween
masked
tokens.
Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui)
14. 2008 r ISSN: 2252-8938
BART
12
768
16
Global
139M
Byte-pair
en-
coding
Wikipedia,
BooksCorpus.
-
Generative
sequence
to
se-
quence,
encoder
decoder,
token
masking,
token
deletion,
text
in-
filling,
sentence
permutation,
and
document
rotation.
BART
beats
its
prede-
cessors
on
generation
tasks
such
as
translation
and
achieved
state-of-
the-art
results,
while
performing
similarly
to
RoBERTa
on
discrim-
inative
tasks
including
question
answering
and
classification.
Trained
to
map
corrupted
text
to
the
original
using
an
arbitrary
noising
function.
ConvBERT
12
768
12
Global
124M
WordPiece
OpenWebText
[39]
GPU
and
TPU
Autoencoding,
encoder
With
fewer
parameters
and
lower
costs
Con-
vBERT
consistently
out-
performs
BERT
on
var-
ious
downstream
tasks
with
less
training
cost.
For
reduced
redundancy
and
better
modeling
of
global
and
local
context,
BERT’s
self-attention
blocks
are
replaced
by
mixed-attention
blocks
in-
corporating
self-attention
and
span-based
dynamic
convolutions.
Reformer
12
1024
8
Attenion
with
local
sensitive
hashing
149M
SentencePiece
OpenWebText
[39]
Parallelization
across
8
GPUs
or
8
TPU
v3
cores.
Autoregressive,
decoder.
Performs
well
with
paragmatic
require-
ments,
thanks
to
re-
duction
of
the
attention
complexity.
An
efficient
and
faster
transformer
that
costs
less
time
on
long
sequences
thanks
to
two
optimization
techniques,
local-sensitive
hashing
attention
and
axial
position
encoding.
T5
12
768
12
Global
220M
SentencePiece
The
Colossal
Clean
Crawled
Corpus
(C4)
Cloud
TPU
Pods.
Generative
sequence
to
sequence,
encoder-
decoder.
Entailement,
coreference
challenges,
question
an-
swering
tasks
via
Super-
GLUE
benchmark
To
incorporate
the
vari-
eties
of
most
linguistic
tasks,
T5
pre-trained
on
a
mix
of
supervised
and
un-
supervised
tasks
in
a
text-
to-text
format.
Longformer
12
768
12
Local
+
Global.
149M
Byte-pair-
encoding
Books
corpus,
En-
glish
Wikipedia,
and
Realnews
dataset
[40]
.
Autoregressive,
decoder
Longformer
achieved
state-of-the-art
results
on
two
benchmark
datasets
WikiHop
and
TriviaQA.
For
heigher
training
effi-
ciency
on
long
documents,
Longformer
uses
sparse
matrices
instead
of
atten-
tion
matrices
to
linearly
scale
with
sequences
of
length
up
to
4
096.
Int J Artif Intell, Vol. 12, No. 4, December 2023: 1995–2010
15. Int J Artif Intell ISSN: 2252-8938 r 2009
REFERENCES
[1] K. R. Chowdhary, “Natural language processing,” in Fundamentals of artificial intelligence, 2020, pp. 603–649, doi: 10.1007/978-
81-322-3972-7 19.
[2] M. Rhanoui, M. Mikram, S. Yousfi, A. Kasmi, and N. Zoubeidi, “A hybrid recommender system for patron driven library acquisition
and weeding,” in Journal of King Saud University-Computer and Information Sciences, 2020, vol. 34, no. 6, Part A, pp. 2809–2819,
doi: 10.1016/j.jksuci.2020.10.017.
[3] F. Z. Trabelsi, A. Khtira, and B. El Asri, “Hybrid recommendation systems: a state of art.,” in Proceedings of the 16th
International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE), 2021, pp. 281–288, doi:
10.5220/0010452202810288.
[4] B. Pandey, D. K. Pandey, B. P. Mishra, and W. Rhmann, “A comprehensive survey of deep learning in the field of medical imaging
and medical natural language processing: challenges and research directions,” in Journal of King Saud University-Computer and
Information Sciences, 2021, vol. 34, no. 8, Part A, pp. 5083–5099, doi: 10.1016/j.jksuci.2021.01.007.
[5] A. Harnoune, M. Rhanoui, M. Mikram, S. Yousfi, Z. Elkaimbillah, and B. El Asri, “BERT based clinical knowledge extraction for
biomedical knowledge graph construction and analysis,” in Computer Methods and Programs in Biomedicine Update, 2021, vol. 1,
p. 100042, doi: 10.1016/j.cmpbup.2021.100042.
[6] W. Medhat, A. Hassan, and H. Korashy, “Sentiment analysis algorithms and applications: a survey,” in Ain Shams engineering
journal, 2014, vol. 5, no. 4, pp. 1093–1113, doi: 10.1016/j.asej.2014.04.011.
[7] S. Sun, C. Luo, and J. Chen, “A review of natural language processing techniques for opinion mining systems,” in Information
fusion, 2017, vol. 36, pp. 10–25, doi: 10.1016/j.inffus.2016.10.004.
[8] S. Yousfi, M. Rhanoui, and D. Chiadmi, “Mixed-profiling recommender systems for big data environment,” in First International
Conference on Real Time Intelligent Systems, 2017, pp. 79–89, doi: 10.1007/978-3-319-91337-7 8.
[9] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understand-
ing,” in NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies - Proceedings of the Conference, 2019, vol. 1, pp. 4171–4186, doi: 10.18653/v1/N19-1423.
[10] A. Vaswani et al., “Attention is all you need,” in Proceedings of the 31st Conference on Neural Information Processing Systems,
Dec. 2017, pp. 5998–6008, doi: 10.48550/arXiv.1706.03762.
[11] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding with unsupervised learning,” Pro-
ceedings of the 2018 Conference on Neural Information Processing Systems, 2018.
[12] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI
blog, vol. 1, no. 8, p. 9, 2019.
[13] T. B. Brown et al., “Language models are few-shot learners,” in Proceedings of the 34th International Conference on Neural
Information Processing Systems, 2020, vol. 33, pp. 1877–1901, doi: 10.48550/arXiv.2005.14165.
[14] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language
representations,” International Conference on Learning Representations, 2019, doi: 10.48550/arXiv.1909.11942.
[15] Y. Liu et al., “RoBERTa: a robustly optimized BERT pretraining approach,” arXiv:1907.11692, 2019, doi:
10.48550/arXiv.1907.11692.
[16] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V Le, “XlNet: generalized autoregressive pre-
training for language understanding,” in Advances in neural information processing systems, 2019, pp. 5753–5763, doi:
10.48550/arXiv.1906.08237.
[17] Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov, “Transformer-XL: attentive language models beyond a fixed-length
context,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Jul. 2019, pp. 2978–2988,
doi: 10.18653/v1/P19-1285.
[18] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv
2019,” in arXiv preprint arXiv:1910.01108, 2019, doi: 10.48550/arXiv.1910.01108.
[19] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in arXiv preprint arXiv:1503.02531, vol. 2,
no. 7, Mar. 2015, doi: 10.48550/arXiv.1503.02531.
[20] G. Lample and A. Conneau, “Cross-lingual language model pretraining,” arXiv:1901.07291, 2019, doi:
10.48550/arXiv.1901.07291.
[21] A. Conneau et al., “Unsupervised cross-lingual representation learning at scale,” in Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, 2020, pp. 8440–8451, doi: 10.18653/v1/2020.acl-main.747.
[22] M. Lewis et al., “Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehen-
sion,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880, doi:
10.18653/v1/2020.acl-main.703..
[23] Z. Jiang, W. Yu, D. Zhou, Y. Chen, J. Feng, and S. Yan, “ConvBERT: Improving BERT with span-based dynamic convo-
lution,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020, p. 12, doi:
10.48550/arXiv.2008.02496.
[24] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: the efficient transformer,” arXiv:2001.04451, 2020, doi:
10.48550/arXiv.2001.04451.
[25] C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer.,” in Journal of Machine Learning
Research, 2020, vol. 21, no. 140, pp. 1–67, doi: 10.48550/arXiv.1910.10683.
[26] K. Clark, M.-T. Luong, Q. V Le, and C. D. Manning, “Electra: pre-training text encoders as discriminators rather than generators,”
arXiv:2003.10555, p. 18, 2020, doi: 10.48550/arXiv.2003.10555.
[27] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: the long-document transformer,” arXiv:2004.05150, 2020, doi:
10.48550/arXiv.2004.05150.
[28] P. He, X. Liu, J. Gao, and W. Chen, “DeBERTa: decoding-enhanced BERT with disentangled attention,” arXiv:2006.03654, 2020,
doi: 10.48550/arXiv.2006.03654.
[29] S. Singh and A. Mahmood, “The NLP cookbook: modern recipes for transformer based deep learning architectures,” in IEEE
Access, 2021, vol. 9, pp. 68675–68702, doi: 10.1109/access.2021.3077350.
Analysis of the evolution of advanced transformer-based language models: ... (Nour Eddine Zekaoui)
16. 2010 r ISSN: 2252-8938
[30] Y. Wu et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” in arXiv
preprint arXiv:1609.08144, 2016, doi: 10.48550/arXiv.1609.08144.
[31] T. Kudo and J. Richardson, “Sentence Piece: A simple and language independent subword tokenizer and detokenizer for Neural
Text Processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demon-
strations, 2018, pp. 66–71, doi: 10.18653/v1/D18-2012.
[32] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, vol. 1, pp. 1715–1725, doi:
10.18653/v1/P16-1162.
[33] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings
of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Jun. 2011, vol. 1,
pp. 142–150.
[34] N. E. Zekaoui, “Opinion transformers.” 2023, [Online]. Available: https://github.com/zekaouinoureddine/Opinion-Transformers
(Accessed Jan. 2, 2023).
[35] Y. Zhu et al., “Aligning books and movies: towards story-like visual explanations by watching movies and reading books,” in Pro-
ceedings of the IEEE International Conference on Computer Vision, 2015, vol. 2015 Inter, pp. 19–27, doi: 10.1109/ICCV.2015.11.
[36] T. H. Trinh and Q. V Le, “A simple method for commonsense reasoning,” in arXiv:1806.02847, 2018, doi:
10.48550/arXiv.1806.02847.
[37] R. Parker, D. Graff, J. Kong, K. Chen, and K. Maeda, “English gigaword fifth edition, linguistic data consortium,” 2011, doi:
10.35111/wk4f-qt80.
[38] A. Wang et al., “SuperGLUE: A stickier benchmark for general-purpose language understanding systems,” in Advances in neural
information processing systems, 2019, vol. 32, doi: 10.48550/arXiv.1905.00537.
[39] A. Gokaslan and V. Cohen, “OpenWebText Corpus,” 2019. http://skylion007.github.io/OpenWebTextCorpus (Accessed Jan. 2,
2023).
[40] R. Zellers et al., “Defending against neural fake news,” Advances in Neural Information Processing Systems, vol. 32, p. 12, 2019,
doi: 10.48550/arXiv.1905.12616.
BIOGRAPHIES OF AUTHORS
Nour Eddine Zekaoui holds an Engineering degree in Knowledge and Data Engineer-
ing from School of Information Sciences, Morocco in 2021. He is currently a Machine Learn-
ing Engineer in a tech company. His research focuses on the areas of natural language process-
ing and artificial intelligence, including information retrieval, question answering, semantic simi-
larity, and bioinformatics. He can be contacted at email: noureddinezekaoui@gmail.com or nour-
eddine.zekaoui@esi.ac.ma.
Siham Yousfi is a Professor of Computer Sciences and Big Data at the School of Informa-
tion Sciences, Rabat since 2011. She is a PhD holder from Mohammadia School of engineering of
Mohammed V University in Rabat (2019). Her research interests include big data, natural language
processing and artificial intelligence. She can be contacted at email: syousfi@esi.ac.ma.
Maryem Rhanoui is an Associate Professor of Computer Sciences and Data Engineering at
the School of Information Sciences, Rabat. She received an engineering degree in computer science
then a PhD degree from ENSIAS, Mohammed V University, Rabat 2015). Her research interests
include pattern recognition, computer vision, cybersecurity and medical data analysis. She can be
contacted at email: mrhanoui@esi.ac.ma.
Mounia Mikram is an Associate Professor of Computer Sciences and Mathematics at the
School of Information Sciences, Rabat since 2010. She received her master degree from Mohammed
V University Rabat (2003) and her PhD degree from Mohammed V University, Rabat, and Bordeaux
I University (2008). Her research interests include pattern recognition, computer vision, biometrics
security systems and artificial intelligence. She can be contacted at email: mmikram@esi.ac.ma.
Int J Artif Intell, Vol. 12, No. 4, December 2023: 1995–2010