The document summarizes research on classifying texts using neural networks with different text representation models. It explores using a bag-of-words model with a fully connected neural network and using the word2vec model with a convolutional neural network. The research tested these approaches on a dataset of news articles across 20 categories, finding the word2vec/CNN approach produced more semantically relevant results while also learning a compact text representation.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
Effect of word embedding vector dimensionality on sentiment analysis through ...IAESIJAI
Word embedding has become the most popular method of lexical description
in a given context in the natural language processing domain, especially
through the word to vector (Word2Vec) and global vectors (GloVe)
implementations. Since GloVe is a pre-trained model that provides access to
word mapping vectors on many dimensionalities, a large number of
applications rely on its prowess, especially in the field of sentiment analysis.
However, in the literature, we found that in many cases, GloVe is
implemented with arbitrary dimensionalities (often 300d) regardless of the
length of the text to be analyzed. In this work, we conducted a study that
identifies the effect of the dimensionality of word embedding mapping
vectors on short and long texts in a sentiment analysis context. The results
suggest that as the dimensionality of the vectors increases, the performance
metrics of the model also increase for long texts. In contrast, for short texts,
we recorded a threshold at which dimensionality does not matter.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSijdms
In this paper, we describe the developed model of the Convolutional Neural Networks CNN to a
classification of advertisements. The developed method has been tested on both texts (Arabic and Slovak
texts).The advertisements are chosen on a classified advertisements websites as short texts. We evolved a
modified model of the CNN, we have implemented it and developed next modifications. We studied their
influence on the performing activity of the proposed network. The result is a functional model of the
network and its implementation in Java and Python. And analysis of model results using different
parameters for the network and input data. The results on experiments data show that the developed model
of CNN is useful in the domains of Arabic and Slovak short texts, mainly for some classification of
advertisements. This paper gives complete guidelines for authors submitting papers for the AIRCC
Journals.
Sentimental analysis is a context based mining of text, which extracts and identify subjective information from a text or sentence provided. Here the main concept is extracting the sentiment of the text using machine learning techniques such as LSTM Long short term memory . This text classification method analyses the incoming text and determines whether the underlined emotion is positive or negative along with probability associated with that positive or negative statements. Probability depicts the strength of a positive or negative statement, if the probability is close to zero, it implies that the sentiment is strongly negative and if probability is close to1, it means that the statement is strongly positive. Here a web application is created to deploy this model using a Python based micro framework called flask. Many other methods, such as RNN and CNN, are inefficient when compared to LSTM. Dirash A R | Dr. S K Manju Bargavi "LSTM Based Sentiment Analysis" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42345.pdf Paper URL: https://www.ijtsrd.comcomputer-science/data-processing/42345/lstm-based-sentiment-analysis/dirash-a-r
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSijdms
In this paper, we describe the developed model of the Convolutional Neural Networks CNN to a
classification of advertisements. The developed method has been tested on both texts (Arabic and Slovak
texts).The advertisements are chosen on a classified advertisements websites as short texts. We evolved a
modified model of the CNN, we have implemented it and developed next modifications. We studied their
influence on the performing activity of the proposed network. The result is a functional model of the
network and its implementation in Java and Python. And analysis of model results using different
parameters for the network and input data. The results on experiments data show that the developed model
of CNN is useful in the domains of Arabic and Slovak short texts, mainly for some classification of
advertisements.
Different Similarity Measures for Text Classification Using KnnIOSR Journals
This document summarizes research on classifying textual data using the k-nearest neighbors (KNN) algorithm and different similarity measures. It explores generating 9 different vector representations of text documents and using KNN with similarity measures like Euclidean, Manhattan, squared Euclidean, etc. to classify documents. The researchers tested KNN on a Reuters news corpus with 5,485 training documents across 8 classes and found that normalization and k=4 produced the best accuracy of 94.47%. They conclude KNN with different similarity measures and vector representations is effective for multi-class text classification.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
Effect of word embedding vector dimensionality on sentiment analysis through ...IAESIJAI
Word embedding has become the most popular method of lexical description
in a given context in the natural language processing domain, especially
through the word to vector (Word2Vec) and global vectors (GloVe)
implementations. Since GloVe is a pre-trained model that provides access to
word mapping vectors on many dimensionalities, a large number of
applications rely on its prowess, especially in the field of sentiment analysis.
However, in the literature, we found that in many cases, GloVe is
implemented with arbitrary dimensionalities (often 300d) regardless of the
length of the text to be analyzed. In this work, we conducted a study that
identifies the effect of the dimensionality of word embedding mapping
vectors on short and long texts in a sentiment analysis context. The results
suggest that as the dimensionality of the vectors increases, the performance
metrics of the model also increase for long texts. In contrast, for short texts,
we recorded a threshold at which dimensionality does not matter.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSijdms
In this paper, we describe the developed model of the Convolutional Neural Networks CNN to a
classification of advertisements. The developed method has been tested on both texts (Arabic and Slovak
texts).The advertisements are chosen on a classified advertisements websites as short texts. We evolved a
modified model of the CNN, we have implemented it and developed next modifications. We studied their
influence on the performing activity of the proposed network. The result is a functional model of the
network and its implementation in Java and Python. And analysis of model results using different
parameters for the network and input data. The results on experiments data show that the developed model
of CNN is useful in the domains of Arabic and Slovak short texts, mainly for some classification of
advertisements. This paper gives complete guidelines for authors submitting papers for the AIRCC
Journals.
Sentimental analysis is a context based mining of text, which extracts and identify subjective information from a text or sentence provided. Here the main concept is extracting the sentiment of the text using machine learning techniques such as LSTM Long short term memory . This text classification method analyses the incoming text and determines whether the underlined emotion is positive or negative along with probability associated with that positive or negative statements. Probability depicts the strength of a positive or negative statement, if the probability is close to zero, it implies that the sentiment is strongly negative and if probability is close to1, it means that the statement is strongly positive. Here a web application is created to deploy this model using a Python based micro framework called flask. Many other methods, such as RNN and CNN, are inefficient when compared to LSTM. Dirash A R | Dr. S K Manju Bargavi "LSTM Based Sentiment Analysis" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42345.pdf Paper URL: https://www.ijtsrd.comcomputer-science/data-processing/42345/lstm-based-sentiment-analysis/dirash-a-r
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSijdms
In this paper, we describe the developed model of the Convolutional Neural Networks CNN to a
classification of advertisements. The developed method has been tested on both texts (Arabic and Slovak
texts).The advertisements are chosen on a classified advertisements websites as short texts. We evolved a
modified model of the CNN, we have implemented it and developed next modifications. We studied their
influence on the performing activity of the proposed network. The result is a functional model of the
network and its implementation in Java and Python. And analysis of model results using different
parameters for the network and input data. The results on experiments data show that the developed model
of CNN is useful in the domains of Arabic and Slovak short texts, mainly for some classification of
advertisements.
Different Similarity Measures for Text Classification Using KnnIOSR Journals
This document summarizes research on classifying textual data using the k-nearest neighbors (KNN) algorithm and different similarity measures. It explores generating 9 different vector representations of text documents and using KNN with similarity measures like Euclidean, Manhattan, squared Euclidean, etc. to classify documents. The researchers tested KNN on a Reuters news corpus with 5,485 training documents across 8 classes and found that normalization and k=4 produced the best accuracy of 94.47%. They conclude KNN with different similarity measures and vector representations is effective for multi-class text classification.
The document describes a comparative study of various machine learning and neural network models for detecting abusive language on Twitter. It finds that a bidirectional GRU network trained on word-level features, with a Latent Topic Clustering module, achieves the most accurate results with an F1 score of 0.805 for detecting abusive tweets. Additionally, it explores using context tweets as additional features and finds this improves some models' performance.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models may also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
In this talk, I will present my recent work on neural IR models. We begin with a discussion on learning good representations of text for retrieval. I will present visual intuitions about how different embeddings spaces capture different relationships between items, and their usefulness to different types of IR tasks. The second part of this talk is focused on the applications of deep neural architectures to the document ranking task.
Indexing for Large DNA Database sequencesCSCJournals
Bioinformatics data consists of a huge amount of information due to the large number of sequences, the very high sequences lengths and the daily new additions. This data need to be efficiently accessed for many needs. What makes one DNA data item distinct from another is its DNA sequence. DNA sequence consists of a combination of four characters which are A, C, G, T and have different lengths. Use a suitable representation of DNA sequences, and a suitable index structure to hold this representation at main memory will lead to have efficient processing by accessing the DNA sequences through indexing, and will reduce number of disk I/O accesses. I/O operations needed at the end, to avoid false hits, we reduce the number of candidate DNA sequences that need to be checked by pruning, so no need to search the whole database. We need to have a suitable index for searching DNA sequences efficiently, with suitable index size and searching time. The suitable selection of relation fields, where index is build upon has a big effect on index size and search time. Our experiments use the n-gram wavelet transformation upon one field and multi-fields index structure under the relational DBMS environment. Results show the need to consider index size and search time while using indexing carefully. Increasing window size decreases the amount of I/O reference. The use of a single field and multiple fields indexing is highly affected by window size value. Increasing window size value lead to better searching time with special type index using single filed indexing. While the search time is almost good and the same with most index types when using multiple field indexing. Storage space needed for RDMS indexing types are almost the same or greater than the actual data.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
Analysis of Opinionated Text for Opinion Miningmlaij
In sentiment analysis, the polarities of the opinions expressed on an object/feature are determined to assess the sentiment of a sentence or document whether it is positive/negative/neutral. Naturally, the object/feature is a noun representation which refers to a product or a component of a product, let’s say, the "lens" in a camera and opinions emanating on it are captured in adjectives, verbs, adverbs and noun words themselves. Apart from such words, other meta-information and diverse effective features are also going to play an important role in influencing the sentiment polarity and contribute significantly to the performance of the system. In this paper, some of the associated information/meta-data are explored and investigated in the sentiment text. Based on the analysis results presented here, there is scope for further assessment and utilization of the meta-information as features in text categorization, ranking text document, identification of spam documents and polarity classification problems.
Our project is about guessing the correct missing
word in a given sentence. To find of guess the missing word
we have two main methods one of them statistical language
modeling, while the other is neural language models.
Statistical language modeling depend on the frequency of the
relation between words and here we use Markov chain. Since
neural language models uses artificial neural networks which
uses deep learning, here we use BERT which is the state of art
in language modeling provided by google.
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKijnlc
In recent years, there has been an increasing use of social media among people in Myanmar and writing review on social media pages about the product, movie, and trip are also popular among people. Moreover, most of the people are going to find the review pages about the product they want to buy before deciding whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very important and time consuming for people. Sentiment analysis is one of the important processes for extracting useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar Language.
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
In recent years, there has been an increasing use of social media among people in Myanmar and writing
review on social media pages about the product, movie, and trip are also popular among people. Moreover,
most of the people are going to find the review pages about the product they want to buy before deciding
whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very
important and time consuming for people. Sentiment analysis is one of the important processes for extracting
useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is
proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The
paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar
Language.
Fine grained irony classification through transfer learning approachCSITiaesprime
Nowadays irony appears to be pervasive in all social media discussion forums and chats, offering further obstacles to sentiment analysis efforts. The aim of the present research work is to detect irony and its types in English tweets We employed a new system for irony detection in English tweets, and we propose a distilled bidirectional encoder representations from transformers (DistilBERT) light transformer model based on the bidirectional encoder representations from transformers (BERT) architecture, this is further strengthened by the use and design of bidirectional long-short term memory (Bi-LSTM) network this configuration minimizes data preprocessing tasks proposed model tests on a SemEval2018 task 3, 3,834 samples were provided. Experiment results show the proposed system has achieved a precision of 81% for not irony class and 66% for irony class, recall of 77% for not irony and 72% for irony, and F1 score of 79% for not irony and 69% for irony class since researchers have come up with a binary classification model, in this study we have extended our work for multiclass classification of irony. It is significant and will serve as a foundation for future research on different types of irony in tweets.
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
This document presents a scalable method for image classification using sparse coding and dictionary learning. It proposes parallelizing the computation of image similarity for faster recognition. Specifically, it distributes the task of measuring similarity between images among multiple cores in a cluster. Experimental results on a face recognition dataset show nearly linear speedup when balancing the dataset size and number of nodes. Reconstruction errors are used as a similarity measure, with dictionaries learned using K-SVD for each image. The proposed parallel method distributes this similarity computation process to achieve faster image classification.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
An in-depth review on News Classification through NLPIRJET Journal
This document provides an in-depth literature review of news classification through natural language processing (NLP). It discusses several existing approaches to news classification, including models that use convolutional neural networks (CNNs), graph-based approaches, and attention mechanisms. The document also notes that current search engines often return too many irrelevant results, so classification could help layer search results. It concludes that while many techniques have been developed, inconsistencies remain in effectively classifying news, so further research on combining NLP, feature extraction, and fuzzy logic is needed.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...kevig
The tremendous increase in the amount of available research documents impels researchers to propose
topic models to extract the latent semantic themes of a documents collection. However, how to extract the
hidden topics of the documents collection has become a crucial task for many topic model applications.
Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of
documents collection increases. In this paper, the Correlated Topic Model with variational Expectation-
Maximization algorithm is implemented in MapReduce framework to solve the scalability problem. The
proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of
the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are
conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed
approach has a comparable performance in terms of topic coherences with LDA implemented in
MapReduce framework.
Automatic Grading of Handwritten AnswersIRJET Journal
The document describes a proposed system for automatically grading handwritten exam answers. It uses optical character recognition to extract text from scanned answer sheets. Natural language processing techniques like BERT embeddings and Word Mover's Distance are used to compare the student's answer against reference answers from teachers. The system aims to grade papers quickly and accurately to reduce workload for teachers while still providing detailed performance assessments for students. It was developed to address the need for online and at-scale exam grading given limitations of traditional in-person exams.
Optimizer algorithms and convolutional neural networks for text classificationIAESIJAI
Lately, deep learning has improved the algorithms and the architectures of several natural language processing (NLP) tasks. In spite of that, the performance of any deep learning model is widely impacted by the used optimizer algorithm; which allows updating the model parameters, finding the optimal weights, and minimizing the value of the loss function. Thus, this paper proposes a new convolutional neural network (CNN) architecture for text classification (TC) and sentiment analysis and uses it with various optimizer algorithms in the literature. Actually, in NLP, and particularly for sentiment classification concerns, the need for more empirical experiments increases the probability of selecting the pertinent optimizer. Hence, we have evaluated various optimizers on three types of text review datasets: small, medium, and large. Thereby, we examined the optimizers regarding the data amount and we have implemented our CNN model on three different sentiment analysis datasets so as to binary label text reviews. The experimental results illustrate that the adaptive optimization algorithms Adam and root mean square propagation (RMSprop) have surpassed the other optimizers. Moreover, our best CNN model which employed the RMSprop optimizer has achieved 90.48% accuracy and surpassed the state-of-the-art CNN models for binary sentiment classification problems.
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONSIJDKP
Big Data creates many challenges for data mining experts, in particular in getting meanings of text data. It is beneficial for text mining to build a bridge between word embedding process and graph capacity to connect the dots and represent complex correlations between entities. In this study we examine processes of building a semantic graph model to determine word associations and discover document topics. We introduce a novel Word2Vec2Graph model that is built on top of Word2Vec word embedding model. We demonstrate how this model can be used to analyze long documents, get unexpected word associations and uncover document topics. To validate topic discovery method we transfer words to vectors and vectors to images and use CNN deep learning image classification.
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGIJCI JOURNAL
This document summarizes a research paper that proposes using a combination of Natural Language Processing and statistical models to match features between different datasets. Specifically, it uses BERT (Bidirectional Encoder Representations from Transformers), a pretrained NLP model, in parallel with Jaccard similarity to measure similarity between feature lists. The hybrid approach reduces time required for manual feature matching compared to previous methods. The paper describes preprocessing data, generating embeddings with BERT, calculating similarity scores with BERT and Jaccard, and outputting top matches above a threshold. It provides example results matching house sales and movie metadata features. The hybrid approach leverages strengths of BERT's semantic understanding and Jaccard's flexibility for special characters.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
More Related Content
Similar to Texts Classification with the usage of Neural Network based on the Word2vec’s Words Representation
The document describes a comparative study of various machine learning and neural network models for detecting abusive language on Twitter. It finds that a bidirectional GRU network trained on word-level features, with a Latent Topic Clustering module, achieves the most accurate results with an F1 score of 0.805 for detecting abusive tweets. Additionally, it explores using context tweets as additional features and finds this improves some models' performance.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models may also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
In this talk, I will present my recent work on neural IR models. We begin with a discussion on learning good representations of text for retrieval. I will present visual intuitions about how different embeddings spaces capture different relationships between items, and their usefulness to different types of IR tasks. The second part of this talk is focused on the applications of deep neural architectures to the document ranking task.
Indexing for Large DNA Database sequencesCSCJournals
Bioinformatics data consists of a huge amount of information due to the large number of sequences, the very high sequences lengths and the daily new additions. This data need to be efficiently accessed for many needs. What makes one DNA data item distinct from another is its DNA sequence. DNA sequence consists of a combination of four characters which are A, C, G, T and have different lengths. Use a suitable representation of DNA sequences, and a suitable index structure to hold this representation at main memory will lead to have efficient processing by accessing the DNA sequences through indexing, and will reduce number of disk I/O accesses. I/O operations needed at the end, to avoid false hits, we reduce the number of candidate DNA sequences that need to be checked by pruning, so no need to search the whole database. We need to have a suitable index for searching DNA sequences efficiently, with suitable index size and searching time. The suitable selection of relation fields, where index is build upon has a big effect on index size and search time. Our experiments use the n-gram wavelet transformation upon one field and multi-fields index structure under the relational DBMS environment. Results show the need to consider index size and search time while using indexing carefully. Increasing window size decreases the amount of I/O reference. The use of a single field and multiple fields indexing is highly affected by window size value. Increasing window size value lead to better searching time with special type index using single filed indexing. While the search time is almost good and the same with most index types when using multiple field indexing. Storage space needed for RDMS indexing types are almost the same or greater than the actual data.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
Analysis of Opinionated Text for Opinion Miningmlaij
In sentiment analysis, the polarities of the opinions expressed on an object/feature are determined to assess the sentiment of a sentence or document whether it is positive/negative/neutral. Naturally, the object/feature is a noun representation which refers to a product or a component of a product, let’s say, the "lens" in a camera and opinions emanating on it are captured in adjectives, verbs, adverbs and noun words themselves. Apart from such words, other meta-information and diverse effective features are also going to play an important role in influencing the sentiment polarity and contribute significantly to the performance of the system. In this paper, some of the associated information/meta-data are explored and investigated in the sentiment text. Based on the analysis results presented here, there is scope for further assessment and utilization of the meta-information as features in text categorization, ranking text document, identification of spam documents and polarity classification problems.
Our project is about guessing the correct missing
word in a given sentence. To find of guess the missing word
we have two main methods one of them statistical language
modeling, while the other is neural language models.
Statistical language modeling depend on the frequency of the
relation between words and here we use Markov chain. Since
neural language models uses artificial neural networks which
uses deep learning, here we use BERT which is the state of art
in language modeling provided by google.
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKijnlc
In recent years, there has been an increasing use of social media among people in Myanmar and writing review on social media pages about the product, movie, and trip are also popular among people. Moreover, most of the people are going to find the review pages about the product they want to buy before deciding whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very important and time consuming for people. Sentiment analysis is one of the important processes for extracting useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar Language.
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
In recent years, there has been an increasing use of social media among people in Myanmar and writing
review on social media pages about the product, movie, and trip are also popular among people. Moreover,
most of the people are going to find the review pages about the product they want to buy before deciding
whether they should buy it or not. Extracting and receiving useful reviews over interesting products is very
important and time consuming for people. Sentiment analysis is one of the important processes for extracting
useful reviews of the products. In this paper, the Convolutional LSTM neural network architecture is
proposed to analyse the sentiment classification of cosmetic reviews written in Myanmar Language. The
paper also intends to build the cosmetic reviews dataset for deep learning and sentiment lexicon in Myanmar
Language.
Fine grained irony classification through transfer learning approachCSITiaesprime
Nowadays irony appears to be pervasive in all social media discussion forums and chats, offering further obstacles to sentiment analysis efforts. The aim of the present research work is to detect irony and its types in English tweets We employed a new system for irony detection in English tweets, and we propose a distilled bidirectional encoder representations from transformers (DistilBERT) light transformer model based on the bidirectional encoder representations from transformers (BERT) architecture, this is further strengthened by the use and design of bidirectional long-short term memory (Bi-LSTM) network this configuration minimizes data preprocessing tasks proposed model tests on a SemEval2018 task 3, 3,834 samples were provided. Experiment results show the proposed system has achieved a precision of 81% for not irony class and 66% for irony class, recall of 77% for not irony and 72% for irony, and F1 score of 79% for not irony and 69% for irony class since researchers have come up with a binary classification model, in this study we have extended our work for multiclass classification of irony. It is significant and will serve as a foundation for future research on different types of irony in tweets.
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
This document presents a scalable method for image classification using sparse coding and dictionary learning. It proposes parallelizing the computation of image similarity for faster recognition. Specifically, it distributes the task of measuring similarity between images among multiple cores in a cluster. Experimental results on a face recognition dataset show nearly linear speedup when balancing the dataset size and number of nodes. Reconstruction errors are used as a similarity measure, with dictionaries learned using K-SVD for each image. The proposed parallel method distributes this similarity computation process to achieve faster image classification.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
An in-depth review on News Classification through NLPIRJET Journal
This document provides an in-depth literature review of news classification through natural language processing (NLP). It discusses several existing approaches to news classification, including models that use convolutional neural networks (CNNs), graph-based approaches, and attention mechanisms. The document also notes that current search engines often return too many irrelevant results, so classification could help layer search results. It concludes that while many techniques have been developed, inconsistencies remain in effectively classifying news, so further research on combining NLP, feature extraction, and fuzzy logic is needed.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...kevig
The tremendous increase in the amount of available research documents impels researchers to propose
topic models to extract the latent semantic themes of a documents collection. However, how to extract the
hidden topics of the documents collection has become a crucial task for many topic model applications.
Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of
documents collection increases. In this paper, the Correlated Topic Model with variational Expectation-
Maximization algorithm is implemented in MapReduce framework to solve the scalability problem. The
proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of
the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are
conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed
approach has a comparable performance in terms of topic coherences with LDA implemented in
MapReduce framework.
Automatic Grading of Handwritten AnswersIRJET Journal
The document describes a proposed system for automatically grading handwritten exam answers. It uses optical character recognition to extract text from scanned answer sheets. Natural language processing techniques like BERT embeddings and Word Mover's Distance are used to compare the student's answer against reference answers from teachers. The system aims to grade papers quickly and accurately to reduce workload for teachers while still providing detailed performance assessments for students. It was developed to address the need for online and at-scale exam grading given limitations of traditional in-person exams.
Optimizer algorithms and convolutional neural networks for text classificationIAESIJAI
Lately, deep learning has improved the algorithms and the architectures of several natural language processing (NLP) tasks. In spite of that, the performance of any deep learning model is widely impacted by the used optimizer algorithm; which allows updating the model parameters, finding the optimal weights, and minimizing the value of the loss function. Thus, this paper proposes a new convolutional neural network (CNN) architecture for text classification (TC) and sentiment analysis and uses it with various optimizer algorithms in the literature. Actually, in NLP, and particularly for sentiment classification concerns, the need for more empirical experiments increases the probability of selecting the pertinent optimizer. Hence, we have evaluated various optimizers on three types of text review datasets: small, medium, and large. Thereby, we examined the optimizers regarding the data amount and we have implemented our CNN model on three different sentiment analysis datasets so as to binary label text reviews. The experimental results illustrate that the adaptive optimization algorithms Adam and root mean square propagation (RMSprop) have surpassed the other optimizers. Moreover, our best CNN model which employed the RMSprop optimizer has achieved 90.48% accuracy and surpassed the state-of-the-art CNN models for binary sentiment classification problems.
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONSIJDKP
Big Data creates many challenges for data mining experts, in particular in getting meanings of text data. It is beneficial for text mining to build a bridge between word embedding process and graph capacity to connect the dots and represent complex correlations between entities. In this study we examine processes of building a semantic graph model to determine word associations and discover document topics. We introduce a novel Word2Vec2Graph model that is built on top of Word2Vec word embedding model. We demonstrate how this model can be used to analyze long documents, get unexpected word associations and uncover document topics. To validate topic discovery method we transfer words to vectors and vectors to images and use CNN deep learning image classification.
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGIJCI JOURNAL
This document summarizes a research paper that proposes using a combination of Natural Language Processing and statistical models to match features between different datasets. Specifically, it uses BERT (Bidirectional Encoder Representations from Transformers), a pretrained NLP model, in parallel with Jaccard similarity to measure similarity between feature lists. The hybrid approach reduces time required for manual feature matching compared to previous methods. The paper describes preprocessing data, generating embeddings with BERT, calculating similarity scores with BERT and Jaccard, and outputting top matches above a threshold. It provides example results matching house sales and movie metadata features. The hybrid approach leverages strengths of BERT's semantic understanding and Jaccard's flexibility for special characters.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Similar to Texts Classification with the usage of Neural Network based on the Word2vec’s Words Representation (20)
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Manufacturing Process of molasses based distillery ppt.pptx
Texts Classification with the usage of Neural Network based on the Word2vec’s Words Representation
1. International Journal on Soft Computing (IJSC) Vol.14, No.2, May 2023
DOI: 10.5121/ijsc.2023.14201 1
TEXTS CLASSIFICATION WITH THE USAGE OF
NEURAL NETWORK BASED ON THE WORD2VEC’S
WORDS REPRESENTATION
D. V. Iatsenko
Department of Information and Measuring Technologies, Southern Federal University,
Milchakova 10, Rostov-on-Don 344090, Russia
ABSTRACT
Assigning the submitted text to one of the predetermined categories is required when dealing with
application-oriented texts. There are many different approaches to solving this problem, including using
neural network algorithms. This article explores using neural networks to sort news articles based on their
category. Two word vectorization algorithms are being used — The Bag of Words (BOW) and the
word2vec distributive semantic model. For this work the BOW model was applied to the FNN, whereas the
word2vec model was applied to CNN. We have measured the accuracy of the classification when applying
these methods for ad texts datasets. The experimental results have shown that both of the models show us
quite the comparable accuracy. However, the word2vec encoding used for CNN showed more relevant
results, regarding to the texts semantics. Moreover, the trained CNN, based on the word2vec architecture,
has produced a compact feature map on its last convolutional layer, which can then be used in the future
text representation. I.e. Using CNN as a text encoder and for learning transfer.
KEYWORDS
Deep Learning, Text classification, Word2Vec, BOW, CNN
1. INTRODUCTION
Text classification is one of the most common text processing tasks. This task is quite close to the
one of relevance of the text based on its query, which is now mostly solved by modern search
engines like Yandex, Google, etc. At different times, this type of tasks was solved in various
methods. One of the first methods was to use the statistical measure of the word’s importance
TF-IDF in order to define the text’s relevance to the given word. Many other different methods
and principles was used afterwards, mainly based on the statistical characteristics of the words
and their parts, semantical analysis of the given text, vector representation of those words and on
the measurement of the proximity of those words, based on different neural networks, etc [1].
Let’s take a look at solving this type of text classification issue, based on different text
representation models and different types of neural networks. Let’s take commonly known fetch
20newsgroups data set from the sklearn.datasets. pack for analysis.
2. DATASET DESCRIPTION
The fetch 20 newsgroups Dataset contains texts of different ads taken from the e-bulletin boards.
There are a total of 11314 training texts and 7532 test texts in the dataset. The presented ads are
divided into 20 classes:
2. International Journal on Soft Computing (IJSC) Vol.14, No.2, May 2023
2
0. alt.atheism,
1. comp.graphics,
2. comp.os.ms-windows.misc,
3. comp.sys.ibm.pc.hardware,
4. comp.sys.mac.hardware,
5. comp.windows.x,
6. misc.forsale,
7. rec.autos,
8. rec.motorcycles,
9. rec.sport.baseball,
10. rec.sport.hockey,
11. sci.crypt,
12. sci.electronics,
13. sci.med,
14. sci.space,
15. soc.religion.christian,
16. talk.politics.guns,
17. talk.politics.mideast,
18. talk.politics.misc,
19. talk.religion.misc
The text contains 200 to 300 words on average. This is an example of the dataset’s text, mapped
to the rec.autos class:
”From: lerxst@wam.umd.edu (where’s my thing) Subject:
WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15
I was wondering if anyone out there could enlighten me on this car I sawgtfvt the other day.
It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The
doors were really small. In addition, the front bumper was separate from the rest of the body.
This is all I know. If anyone can tell me a model name, engine specs, years of production, where
this car is made, history, or whatever info you have on this funky looking car, please e-mail.
Thanks,
- IL
—- brought to you by your neighborhood Lerxst —-”[2] The text
distribution goes quite evenly (l.a Figure 1)
3. International Journal on Soft Computing (IJSC) Vol.14, No.2, May 2023
3
(a) Text classes distribution on (b) Text classes distribution on valtraining dataset. idating dataset.
Figure 1: Text classes distribution.
3. TEXT REPRESENTATION MODELS
In order to process the text using neural networks the text must be presented as a vector of real
numbers, set in a format, coinciding with the network input. The input data format can vary both
in the vector value’s dimension and in the coding principle. I.e. Encoding could either be one-hot
encoding (positional encoding) or vector representation.
3.1. The ”Bag of Words” Model
The ”Bag of Words” model (BoW) is essentially a simple text vectorization based on a particular
word’s characteristic (TF-IDF). In order to make this method work, the marked text corpus goes
for the following transformations:
• The text corpus gets tokenized.
• A corpus dictionary is being built with a table of frequency for words occurrence, usually
sorted in descending order. Additionally, the diagram of the frequency for words
occurrence and index in the dictionary corresponds to the Zpif diagram (Figure 2).
Figure 2: Distribution of the words relative frequencies.
• Words with too high or too low frequency are discarded, because on the former casethey
will hardly ever occur in the future (Which means that they will have less of an impact
for the overall analysis), and on the latter case they will occur almost everywhere (That
way it will not help to classify the text into any category at all). The formula for DF
calculation (1) is widely known and, in general, represents the ratio of the number of
4. International Journal on Soft Computing (IJSC) Vol.14, No.2, May 2023
4
texts containing the desired words to the total amount of words in the given corpus. IDF
is more like an inverted version of the DF, the value of which is sometimes appears as
logarithmic (2) for overall standardization [3]:
(1)
(2)
• Thus, using this method, each text is represented as a vector of real numbers witha size
equal to the size of the corpus dictionary, each unit of which contains a number equal to
the number of i-th word occurrences in the text by its frequency characteristic DFi (3).
vali = ni · DFi (3)
• A set of pairs of such vectors and text class numbers are the training data for theneural
network that will carry out the classification.
It should be noted, that the final dataset will be represented by a large K*W discharged matrix,
where K is the number of overall texts, and W is the dictionary’s size.
3.2. Word2Vec Model
Let’s look at the alternative representation of the vectorization text, using the representation of a
word as a vector in space in the word2vec’s semantically-distributive language model [4]. Using
this model, a word does not encode by the one-hot encoding principle, but goes as a numeric
value of a vector in the N-dimensional space of the model’s semantic representation . Such
models are presented in the form of embedding and shown in public projects by many research
groups. The following embedding were used in the presented study:
• word2vec-google-news-300
• glove-wiki-gigaword-50
• glove-wiki-gigaword-100
• glove-wiki-gigaword-200
• glove-wiki-gigaword-300
• glove-twitter-25
• glove-twitter-50
• glove-twitter-100
• glove-twitter-200
Thus, the word is represented by a vector of values with a dimension of 25 to 300 numbers,
depending on the selected model. In our case, the text is a set of such vectors.
Using the classic approach with values instead of one-hot encoding is highly inefficient, but in
word2vec method it is quite plausible, mostly because the close values in the vector word
representation correspond to the semantically close entities in the given model’s semantical
space. I.e. Any small changes in the input values will not change the behavior of the model
significantly [5].
The preparation of the dataset shares the similarity with the stages for the ”Bag of Words” model.
The corpus of texts is tokenized, each word is replaced by a vector given from a word2vec
model, which results in each text getting represented by a two-dimensional N*M matrix, where N
is the number of words, and M is the dimension of the selected embedding. The whole dataset
5. International Journal on Soft Computing (IJSC) Vol.14, No.2, May 2023
5
will be presented as a three-dimensional K*N*M matrix, where K is the number of texts in the
corpus [6].
This version creates a small issue, based on the fact that the length of the texts is different, which
adds complexity within the description of the dimension of the neural network. There are various
methods in order to solve this kind of problem:
• Sometimes the N leads to a single predetermined value by adding zero vectors tothe end
of the text in cases if the text is shorter, otherwise, if the text is longer, then those vectors
get erased.
• You can also use the global pulling operation, as an alternative.
• Pre-trained RNN can also be used for forming fixed vector featuremaps when itgoes
through the text.
In our case, we will be using the first method, since the initial data set is assumed to be a
collection of small close-sized texts.
4. IMPLEMENTATION OF THE NEURAL NETWORKS FOR WORKING WITH
VARIOUS MODELS OF TEXT REPRESENTATION
4.1. Fully Connected Neural Network and The ”Bag Of Words”
The Fully-connected Neural Network (FNN) may well be one of the output layers by the number
of classes. In this case, for two-class training the loss function should be BCELoss (Binary cross
Entropy) (4), or Softmax (5) for a multi-class training.
The network is trained by gradually minimizing the value of the loss function. During this
process, adjustment of the model’s parameters take place, i.e. the weights of the network. The
current algorithm is used in order to minimize the loss function (6):
wi = wi−1 −α · ∇(L) (6)
It is possible to classify the texts after the network training. In order to do this, the given text gets
tokenized, turned into a vector, according to the previous algorithm (while using the same
dictionary), after which the resulting network vector is presented [7]. During this, the predicted
class output will have the largest value in the range from (0:1), which corresponds to the
probability that this test actually belongs to the predicted class (Figure 3).
The presented network has the following kind of architecture: One FNN with the input amount
based on the unique works (UNIQUE_WORDS_N) equals 21628 in the following example, and
the amount of neurons based on the amount of text classes (UNIQUE_LABELS_N) equals 20 in
the given example. Shown below is the network structure made in Keras, taking the tabular
model.summary() representation (Table 1).
The source code of the notebook with the network implementation is shown here
https://github.com/dyacenko/article_text_processing/blob/main/newo
6. International Journal on Soft Computing (IJSC) Vol.14, No.2, May 2023
6
Figure 3: Processing of vectorized text by a Fully-connected Neural Network.
Table 1: Fully-connected NN Architecture.
Layer (type) Output Shape Param#
Linear-1 [-1, 20] 432,580
Total params: 432,580
Trainable params: 432,580
Non-trainable params:0
Input size (MB): 0.32
Forward/backward pass size (MB): 0.99
Params size (MB): 1.65
Estimated Total Size (MB): 2.64
rk1_1_tfidf.ipynb.
The way network is trained is by presenting it with samples of texts encoded according to the TF-
IDF principle. During the process, cross_entropy from the torch.nn.functional package is being
used for loss evaluation. Further on, the resulting loss value is minimized by means of applying
the gradient descent method with addition of Adam optimizer from the torch.optim. package. The
network is represented by a large discharged vector in the given example. Because of this, special
data structures were used for more efficient storage of the discharged data for this example.
Shown below are graphs of the loss value functions at each epoch during network training and
testing (Figure 4). As show from the graph the network is learning quite steadily and stable, and
reaches a plateau at 15th epoch. Relearning process does not occur. As a result, the network
shows classification accuracy of 0.766 during the test. Shown below is the confusion matrix of
the trained network (Figure 5) when the texts taken from the test set are processed.
7. International Journal on Soft Computing (IJSC) Vol.14, No.2, May 2023
7
Figure 4: Training of FC network metrics.
Figure 5: confusion matrix of a FNN.
4.2. Convolutional Neural Network and Word2vec Encoding
Now let’s use a deep Convolutional neural network to classify the vectorized texts, based on the
word2vec distributive semantic model. The convolution process has been in the field of signal
processing, as well as image processing and filtering for quite a long period of time. In general, a
convolution operation (7) is a sort of operation that works with a pair of matrices A = (nx,ny), and
B = (mx,my) which results in a matrix C = A∗B of size (nxmy +1) . Each element of the given result
calculated as a scalar product of thematrix and A submatrix of the
is
size as the B. In addition, the submatrix is determined by the
same
position of the element given on the result. I.e:
(7)
The process of a convolutional neural network described in [8] by Jan Lekun uses quite a similar
mathematical apparatus, but in application to neural networks. Which means that the A matrix –
8. International Journal on Soft Computing (IJSC) Vol.14, No.2, May 2023
8
neural network inputs, a B matrix – the convolution core, which essentially acts as a special case
of an incompletely connected neural network, and the C matrix is the network output. In this case,
the B matrix elements – are the weights of the neural connection, which are adjusted in the
process of teaching the network along with the value of the offset that is embedded in the
convolution core.
A CNN can be used in a wide variety of case. For the current case the conv1d onedimensional
convolution was used. In this case, the convolution core moves in only one direction – through
the text, performing the convolution operation on the three adjacent words (Figure 6). In addition,
the number of the output channels is take by the number of the input channels. Subsamping
operation and activation function are triggering after the convolution layer. Deep convolutional
neural networks are often used in a way that makes it so the output of one convolutional layer
acts as an input for another one, as it is in modern architecture [9].
Figure 6: The processing of vectorized text by the means of CNN
As a result of a series of experiments, it was discovered that the optimal network depth for the
current task is 9 convolutional layers, with a feature map formed as a single vector with size
equal to the size of a word vector given from the output. This example shows 300 values.
According to this feature map, the FNN classifies the texts using the number of neurons equal to
the number of classes.
That way this network’s architecture consists of - 9 convolutional layers, 300 output and input
channel with the size of each equals the size of the convolutional core, which is 3, and one FNN
with 300 inputs and 20 outputs (Table 2).
It is important to note that with the network this deep, using the gradient descent results in the
gradient attenuation [10], i.e. The gradient practically does not reach the first layers of the
network; therefore, they are not being trained. In order to minimize the impact of this effect, the
ReLU activation function [11] and the residual block mechanism, which were first used by the
developers of the deep ResNet network [12], were used in this work.
10. International Journal on Soft Computing (IJSC) Vol.14, No.2, May 2023
10
The source code of the notebook with the network implementation is shown here
https://github.com/dyacenko/article_text_processing/blob/main/netwo
rk2_vec.ipynb
The network is learning by presenting a network of text samples encoded, according to the
principle of encoding words with vectors from the word2vec distributive model. During the
process, cross_entropy from the torch.nn.functional package is being used for loss evaluation.
Further on, the resulting loss value is minimized by means of applying the gradient descent
method with addition of Adam optimizer from the torch.optim. package. Shown below are graphs
of the loss value functions at each epoch during network training and testing (Figure 7).
As show from the graph the network is learning quite steadily and stable, and reaches a
Figure 7: Training of convolutional network metrics.
plateau at 15th epoch. Relearning process does not occur. As a result, the network shows
classification accuracy of 0.763 during the test. Shown below is the confusion matrix of the
trained network (Figure 8) when the texts taken from the test set are processed.
4.3. Adaboost and TF-IDF Encoding
For comparison, let’s use the already made adaboost classifying model from the sklearn library.
We will use a Decision Tree as a basic algorithm We will also use the same fetch 20newsgroups
dataset with TF-IDF encoding. The source code of a notebook with a network implementation is
shown here: https://github.com/d-yacenko/artic
le_text_processing/blob/main/network1_1_ada.ipynbAs a result, the model
demonstrates an accuracy of 0.6523 on the test dataset. Shown below is the confusion matrix of
the trained network (Figure9) when the texts taken from the test set are processed.
11. International Journal on Soft Computing (IJSC) Vol.14, No.2, May 2023
11
Figure 8: confusion matrix of a CNN.
Figure 9: confusion matrix of a Adaboost.
5. ANALYSIS RESULT
All of the aforementioned experiments were conducted on a Virtual PC:
• CPU Intel(R) Xeon(R) CPU @ 2.00GHz
• RAM 12Gb
• GPU Tesla P100 with 16GB of RAM
• OS Ubuntu 18.04.5 LTS (Bionic Beaver)
The final characteristics of neural networks of different architectures considered in this article are
as follows (Table 3):
12. International Journal on Soft Computing (IJSC) Vol.14, No.2, May 2023
12
Table 3: Pivot charcteristics of different networks.
Number of
weights
Network
training
time (s)
Time
of pre-
diction
(ms)
Max
accuracy
Fully Connected 432,580 96 0.4 0.766
Convolution 2,438,720 679 5.4 0.763
Adaboost – 1656 0.16 0.663
6. DISCUSSIONS AND CONCLUSIONS
The comparison of different encoding text for processing in neural networks from different kinds
of architectures has been discussed by many authors[13], [14]. The study also shows the practical
results and compares them with different architectures. As can be seen from the presented data,
different approaches and methods show similar levels of accuracy. The Convolutional Neural
Network has a much more complex architecture and has even more configurable parameters. The
time to train for both of the networks is comparable. There are noticeable differences in the
distribution of errors by category, which make it possible to interpret the cause in the use of
semantic proximity in the work of the Convolutional Neural Network with the word2vec
encoding. This is quite noticeable, for example, in the fact that the CNN is much more likely to
mistakenly classify texts from the soc.religion.christian (class 15) category as talk.religion.misc
(class 19) – 22 vs 14. At the same time, a network which is based on a distributive semantic
coding is much less likely to mistakenly classify talk.politics.guns (class 16) with talk.religion.
misc (class19) compared to TF-IDF encoding – 5 vs 18. Moreover, it is obvious that a network
with the word2vec encoding can classify texts with words not represented in the training samples,
if these words are represented in the embedding used to encode the texts.
In addition, a trained CNN can be used to convert texts into a feature map, which can then be
used in transfer learning [15] or as an encoder for texts encoding. All in al, despite the similar
results show by both of the methods, the CNN with word2vec encoding has an additional way of
use.
REFERENCES
[1] K. Mehta and S. P. Panda, “Sentiment analysis on e-commerce apparels using convolutional neural
network,” International Journal of Computing, vol. 21, pp. 234–241, Jun. 2022.
[2] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P.
Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M.
Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning
Research, vol. 12, pp. 2825–2830, 2011.
[3] S. Robertson, “Understanding inverse document frequency: on theoretical arguments for IDF,”
Journal of Documentation, vol. 60, pp. 503–520, oct 2004.
[4] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in
vector space,” CoRR, vol. abs/1301.3781, 2013.
[5] Z. Rui and H. Yutai, “Research on short text classification based on word2vec microblog,” in 2020
International Conference on Computer Science and Management Technology (ICCSMT), pp. 178–
182, 2020.
[6] G. Mustafa, M. Usman, L. Yu, M. Afzal, M. Sulaiman, and A. Shahid, “Multi-label classification of
research articles using word2vec and identification of similarity threshold,” Scientific Reports, vol.
11, p. 21900, 11 2021.
13. International Journal on Soft Computing (IJSC) Vol.14, No.2, May 2023
13
[7] B. Das and S. Chakraborty, “An improved text sentiment classification model using TF-IDF and next
word negation,” CoRR, vol. abs/1806.06407, 2018.
[8] Y. Lecun and Y. Bengio, Convolutional Networks for Images, Speech and Time Series, pp. 255–258.
The MIT Press, 1995.
[9] S. Baker and A. Korhonen, “Initializing neural networks for hierarchical multi-label text
classification,” in BioNLP 2017, (Vancouver, Canada,), pp. 307–315, Association for Computational
Linguistics, Aug. 2017.
[10] S. Hochreiter, “Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für
Informatik, Lehrstuhl Prof. Brauer, Technische Universität München,” 1991.
[11] A. Petrosyan, A. Dereventsov, and C. G. Webster, “Neural network integral representations with the
ReLU activation function,” in Proceedings of The First Mathematical and Scientific Machine
Learning Conference (J. Lu and R. Ward, eds.), vol. 107 of Proceedings of Machine Learning
Research, pp. 128–143, PMLR, 20– 24 Jul 2020.
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol.
abs/1512.03385, 2015.
[13] B. Jang, I. Kim, and J. W. Kim, “Word2vec convolutional neural networks for classification of news
articles and tweets,” PLOS ONE, vol. 14, pp. 1–20, 08 2019.
[14] R. Kurnia and A. Girsang, “Classification of user comment using word2vec and deep learning,”
International Journal of Emerging Technology and Advanced Engineering, vol. 11, pp. 1–8, 05 2021.
[15] S. Bozinovski, “Reminder of the first paper on transfer learning in neural networks, 1976,”
Informatica (Slovenia), vol. 44, no. 3, 2020.