The length of a news article may influence people’s interest to read the article. In this case, text summarization can help to create a shorter representative version of an article to reduce people’s read time. This paper proposes to use weighted word embedding based on Word2Vec, FastText, and bidirectional encoder representations from transformers (BERT) models to enhance the TextRank summarization algorithm. The use of weighted word embedding is aimed to create better sentence representation, in order to produce more accurate summaries. The results show that using (unweighted) word embedding significantly improves the performance of the TextRank algorithm, with the best performance gained by the summarization system using BERT word embedding. When each word embedding is weighed using term frequency-inverse document frequency (TF-IDF), the performance for all systems using unweighted word embedding further significantly improve, with the biggest improvement achieved by the systems using Word2Vec (with 6.80% to 12.92% increase) and FastText (with 7.04% to 12.78% increase). Overall, our systems using weighted word embedding can outperform the TextRank method by up to 17.33% in ROUGE-1 and 30.01% in ROUGE-2. This demonstrates the effectiveness of weighted word embedding in the TextRank algorithm for text summarization.
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
The document describes a study that uses GloVe word embeddings to measure semantic similarity between short texts. GloVe is an unsupervised learning algorithm for obtaining vector representations of words. The study trains GloVe word embeddings on a large corpus, then uses the embeddings to encode short texts and calculate their semantic similarity, comparing the accuracy to methods that use Word2Vec embeddings. It aims to show that GloVe embeddings may provide better performance for short text semantic similarity tasks.
Effect of word embedding vector dimensionality on sentiment analysis through ...IAESIJAI
Word embedding has become the most popular method of lexical description
in a given context in the natural language processing domain, especially
through the word to vector (Word2Vec) and global vectors (GloVe)
implementations. Since GloVe is a pre-trained model that provides access to
word mapping vectors on many dimensionalities, a large number of
applications rely on its prowess, especially in the field of sentiment analysis.
However, in the literature, we found that in many cases, GloVe is
implemented with arbitrary dimensionalities (often 300d) regardless of the
length of the text to be analyzed. In this work, we conducted a study that
identifies the effect of the dimensionality of word embedding mapping
vectors on short and long texts in a sentiment analysis context. The results
suggest that as the dimensionality of the vectors increases, the performance
metrics of the model also increase for long texts. In contrast, for short texts,
we recorded a threshold at which dimensionality does not matter.
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURESijnlc
The document summarizes a technique for generating summaries using sentence compression and statistical measures. It first implements a graph-based technique to achieve sentence compression and information fusion. It then uses hand-crafted syntactic rules to prune compressed sentences. Finally, it uses probabilistic measures and word co-occurrence to obtain the summaries. The system can generate summaries at any user-defined compression rate.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
Single document keywords extraction in Bahasa Indonesia using phrase chunkingTELKOMNIKA JOURNAL
Keywords help readers to understand the idea of a document quickly. Unfortunately, considerable time and effort are often needed to come up with a good set of keywords manually. This research focused on generating keywords from a document automatically using phrase chunking. Firstly, we collected part of speech patterns from a collection of documents. Secondly, we used those patterns to extract candidate keywords from the abstract and the content of a document. Finally, keywords are selected from the candidates based on the number of words in the keyword phrases and some scenarios involving candidate reduction and sorting. We evaluated the result of each scenario using precision, recall, and F-measure. The experiment results show: i) shorter-phrase keywords with string reduction extracted from the abstract and sorted by frequency provides the highest score, ii) in every proposed scenario, extracting keywords using the abstract always presents a better result, iii) using shorter-phrase patterns in keywords extraction gives better score in comparison to using all phrase patterns, iv) sorting scenarios based on the multiplication of candidate frequencies and the weight of the phrase patterns offer better results.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Automatic Text Summarization: A Critical ReviewIRJET Journal
This document provides a literature review and critical analysis of automatic text summarization techniques. It discusses extractive and abstractive summarization approaches and reviews 10 papers published between 2009-2021 on topics like graph-based, keyword-based, and feature-based summarization methods. The document aims to identify strengths and limitations of the approaches discussed and opportunities for future work in automatic text summarization.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
The document describes a study that uses GloVe word embeddings to measure semantic similarity between short texts. GloVe is an unsupervised learning algorithm for obtaining vector representations of words. The study trains GloVe word embeddings on a large corpus, then uses the embeddings to encode short texts and calculate their semantic similarity, comparing the accuracy to methods that use Word2Vec embeddings. It aims to show that GloVe embeddings may provide better performance for short text semantic similarity tasks.
Effect of word embedding vector dimensionality on sentiment analysis through ...IAESIJAI
Word embedding has become the most popular method of lexical description
in a given context in the natural language processing domain, especially
through the word to vector (Word2Vec) and global vectors (GloVe)
implementations. Since GloVe is a pre-trained model that provides access to
word mapping vectors on many dimensionalities, a large number of
applications rely on its prowess, especially in the field of sentiment analysis.
However, in the literature, we found that in many cases, GloVe is
implemented with arbitrary dimensionalities (often 300d) regardless of the
length of the text to be analyzed. In this work, we conducted a study that
identifies the effect of the dimensionality of word embedding mapping
vectors on short and long texts in a sentiment analysis context. The results
suggest that as the dimensionality of the vectors increases, the performance
metrics of the model also increase for long texts. In contrast, for short texts,
we recorded a threshold at which dimensionality does not matter.
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURESijnlc
The document summarizes a technique for generating summaries using sentence compression and statistical measures. It first implements a graph-based technique to achieve sentence compression and information fusion. It then uses hand-crafted syntactic rules to prune compressed sentences. Finally, it uses probabilistic measures and word co-occurrence to obtain the summaries. The system can generate summaries at any user-defined compression rate.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
Single document keywords extraction in Bahasa Indonesia using phrase chunkingTELKOMNIKA JOURNAL
Keywords help readers to understand the idea of a document quickly. Unfortunately, considerable time and effort are often needed to come up with a good set of keywords manually. This research focused on generating keywords from a document automatically using phrase chunking. Firstly, we collected part of speech patterns from a collection of documents. Secondly, we used those patterns to extract candidate keywords from the abstract and the content of a document. Finally, keywords are selected from the candidates based on the number of words in the keyword phrases and some scenarios involving candidate reduction and sorting. We evaluated the result of each scenario using precision, recall, and F-measure. The experiment results show: i) shorter-phrase keywords with string reduction extracted from the abstract and sorted by frequency provides the highest score, ii) in every proposed scenario, extracting keywords using the abstract always presents a better result, iii) using shorter-phrase patterns in keywords extraction gives better score in comparison to using all phrase patterns, iv) sorting scenarios based on the multiplication of candidate frequencies and the weight of the phrase patterns offer better results.
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Automatic Text Summarization: A Critical ReviewIRJET Journal
This document provides a literature review and critical analysis of automatic text summarization techniques. It discusses extractive and abstractive summarization approaches and reviews 10 papers published between 2009-2021 on topics like graph-based, keyword-based, and feature-based summarization methods. The document aims to identify strengths and limitations of the approaches discussed and opportunities for future work in automatic text summarization.
A Survey of Various Methods for Text SummarizationIJERD Editor
Document summarization means retrieved short and important text from the source document. In this paper, we studied various techniques. Plenty of techniques have been developed on English summarization and other Indian languages but very less efforts have been taken for Hindi language. Here, we discusses various techniques in which so many features are included such as time and memory consumption, efficiency, accuracy, ambiguity, redundancy.
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
This paper proposes a multi-document summarization system that uses bisect k-means clustering, an optimal merge function, and a neural network. The system first preprocesses input documents through stemming and removing stop words. It then applies bisect k-means clustering to group similar sentences. The clusters are merged using an optimal merge function to find important keywords. The NEWSUM algorithm is used to generate a primary summary for each keyword. A neural network trained on sentence classifications is then used to classify sentences in the primary summary as positive or negative. Only positively classified sentences are included in the final summary to improve accuracy. The system aims to generate a concise and accurate summary in a short period of time from multiple documents on a given topic.
An in-depth review on News Classification through NLPIRJET Journal
This document provides an in-depth literature review of news classification through natural language processing (NLP). It discusses several existing approaches to news classification, including models that use convolutional neural networks (CNNs), graph-based approaches, and attention mechanisms. The document also notes that current search engines often return too many irrelevant results, so classification could help layer search results. It concludes that while many techniques have been developed, inconsistencies remain in effectively classifying news, so further research on combining NLP, feature extraction, and fuzzy logic is needed.
This document discusses an attempt to create an extractive automatic text summarizer. It splits document paragraphs into sentences and ranks the sentences based on summarization features, with higher ranked sentences considered more important for generating the summary. The proposed system uses the TextRank algorithm to rank sentences based on graph-based features. The paper presents the TextRank approach and compares the proposed system to existing MS Word summarization methods. Evaluation measures are also described to assess the performance of the summarizer.
A hybrid approach for text summarization using semantic latent Dirichlet allo...IJECEIAES
Automatic text summarization generates a summary that contains sentences reflecting the essential and relevant information of the original documents. Extractive summarization requires semantic understanding, while abstractive summarization requires a better intermediate text representation. This paper proposes a hybrid approach for generating text summaries that combine extractive and abstractive methods. To improve the semantic understanding of the model, we propose two novel extractive methods: semantic latent Dirichlet allocation (semantic LDA) and sentence concept mapping. We then generate an intermediate summary by applying our proposed sentence ranking algorithm over the sentence concept mapping. This intermediate summary is input to a transformer-based abstractive model fine-tuned with a multi-head attention mechanism. Our experimental results demonstrate that the proposed hybrid model generates coherent summaries using the intermediate extractive summary covering semantics. As we increase the concepts and number of words in the summary the rouge scores are improved for precision and F1 scores in our proposed model.
Improving Text Categorization with Semantic Knowledge in Wikipediachjshan
Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimensional. In traditional text classification methods, document texts are represented with “Bag of Words (BOW)” text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wikipedia concepts and used the Wikipedia-concept-based document representation method to take the place of traditional BOW model for text classification. In order to overcome the weakness of ignoring the semantic relationships among terms in document representation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based document representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method.
Deep sequential pattern mining for readability enhancement of Indonesian summ...IJECEIAES
In text summarization research, readability is a great issue that must be addressed. Our hypothesis is readability can be accomplished by using text representations that keep the meaning of text documents intact. Therefore, this study aims to combine sequential pattern mining (SPM) in producing a sequence of a word as text representation with unsupervised deep learning to produce an Indonesian text summary called DeepSPM. This research uses PrefixSpan as an SPM algorithm and deep belief network (DBN) as an unsupervised deep learning method. This research uses 18,774 Indonesian news text from IndoSum. The readability aspect is evaluated by recall- oriented understudy for gisting evaluation (ROUGE) as a co-selection-based analysis; Dwiyanto Djoko Pranowo metrics, Gunning fog index (GFI), and Flesch-Kincaid grade level (FKGL) as content-based analysis; and human readability evaluation with two experts. The experiment result shows that DeepSPM yields better than DBN, with the F-measure value of ROUGE-1 enhanced to 0.462, ROUGE-2 is 0.37, and ROUGE-L is 0.41. The significance of ROUGE results also be tested using T-Test. The content- based analysis and human readability evaluation findings are conformable with the findings of co-selection-based analysis that generated summaries are only partially readable or have a medium level of readability aspect.
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
IRJET- A Survey Paper on Text Summarization MethodsIRJET Journal
This document summarizes various approaches for automatic text summarization, including extractive and abstractive methods. It discusses early surface-level approaches from the 1950s that identified important sentences based on word frequency. It also reviews corpus-based, cohesion-based, rhetoric-based, and graph-based approaches. The document then examines single document summarization techniques like naive Bayes methods, log-linear models, and deep natural language analysis. It concludes with a discussion of multi-document summarization, including abstraction and information fusion as well as graph spreading activation approaches. The goal of the survey is to provide an overview of the major existing methods that have been used for automatic text summarization.
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents.
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATIONIJNSA Journal
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents.
Myanmar news summarization using different word representations IJECEIAES
There is enormous amount information available in different forms of sources and genres. In order to extract useful information from a massive amount of data, automatic mechanism is required. The text summarization systems assist with content reduction keeping the important information and filtering the non-important parts of the text. Good document representation is really important in text summarization to get relevant information. Bag-ofwords cannot give word similarity on syntactic and semantic relationship. Word embedding can give good document representation to capture and encode the semantic relation between words. Therefore, centroid based on word embedding representation is employed in this paper. Myanmar news summarization based on different word embedding is proposed. In this paper, Myanmar local and international news are summarized using centroid-based word embedding summarizer using the effectiveness of word representation approach, word embedding. Experiments were done on Myanmar local and international news dataset using different word embedding models and the results are compared with performance of bag-of-words summarization. Centroid summarization using word embedding performs comprehensively better than centroid summarization using bag-of-words.
Conceptual framework for abstractive text summarizationijnlc
As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
Automatic Text Summarization using Natural Language ProcessingIRJET Journal
The document discusses automatic text summarization using natural language processing. It describes using the Simplified Lesk calculation and WordNet to assess the importance of sentences in a document and perform word sense disambiguation. The proposed approach assesses sentence weights using Simplified Lesk and orders them by weight. It then selects sentences for the summary based on a given summarization percentage. The approach provides good results for summaries up to 50% of the original text and acceptable results up to 25%.
IRJET- Sewage Treatment Potential of Coir Geotextiles in Conjunction with Act...IRJET Journal
The document discusses using machine learning models for extractive text summarization, which involves selecting important sentences from a document to provide an overview. Five machine learning models are explored: an artificial neural network, naive Bayes classifier, support vector machine, and two convolutional neural networks. The models are used to classify sentences as important or not important based on features like position, length, terms, and similarity to other sentences. The models' performance on this text highlighting task is compared.
IRJET- Text Highlighting – A Machine Learning ApproachIRJET Journal
This document discusses using machine learning models for extractive text summarization. It explores five models - artificial neural network, naive Bayes classifier, support vector machine, and two convolutional neural networks. It also proposes a new method to generate extractive summarization datasets from human summaries. The models are trained and tested on a dataset generated from CNN/Daily Mail articles, with the convolutional neural network approach achieving the highest accuracy.
Latent semantic analysis and cosine similarity for hadith search engineTELKOMNIKA JOURNAL
Search engine technology was used to find information as needed easily, quickly and efficiently, including in searching the information about the hadith which was a second guideline of life for muslim besides the Holy Qur'an. This study was aim to build a specialized search engine to find information about a complete and eleven hadith in Indonesian language. In this research, search engines worked by using latent semantic analysis (LSA) and cosine similarity based on the keywords entered. The LSA and cosine similarity methods were used in forming structured representations of text data as well as calculating the similarity of the keyword text entered with hadith text data, so the hadith information was issued in accordance with what was searched. Based on the results of the test conducted 50 times, it indicated that the LSA and cosine similarity had a success rate in finding high hadith information with an average recall value was 87.83%, although from all information obtained level of precision hadith was found semantically not many, it was indicated by the average precision value was 36.25%.
Towards optimize-ESA for text semantic similarity: A case study of biomedical...IJECEIAES
Explicit Semantic Analysis (ESA) is an approach to measure the semantic relatedness between terms or documents based on similarities to documents of a references corpus usually Wikipedia. ESA usage has received tremendous attention in the field of natural language processing NLP and information retrieval. However, ESA utilizes a huge Wikipedia index matrix in its interpretation by multiplying a large matrix by a term vector to produce a high-dimensional vector. Consequently, the ESA process is too expensive in interpretation and similarity steps. Therefore, the efficiency of ESA will slow down because we lose a lot of time in unnecessary operations. This paper propose enhancements to ESA called optimize-ESA that reduce the dimension at the interpretation stage by computing the semantic similarity in a specific domain. The experimental results show clearly that our method correlates much better with human judgement than the full version ESA approach.
Different Similarity Measures for Text Classification Using KnnIOSR Journals
This document summarizes research on classifying textual data using the k-nearest neighbors (KNN) algorithm and different similarity measures. It explores generating 9 different vector representations of text documents and using KNN with similarity measures like Euclidean, Manhattan, squared Euclidean, etc. to classify documents. The researchers tested KNN on a Reuters news corpus with 5,485 training documents across 8 classes and found that normalization and k=4 produced the best accuracy of 94.47%. They conclude KNN with different similarity measures and vector representations is effective for multi-class text classification.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
Applying natural language processing-related algorithms is currently a popular project in legal
applications, for instance, document classification of legal documents, contract review and machine
translation. Using the above machine learning algorithms, all need to encode the words in the document in
the form of vectors. The word embedding model is a modern distributed word representation approach and
the most common unsupervised word encoding method. It facilitates subjecting other algorithms and
subsequently performing the downstream tasks of natural language processing vis-à-vis. The most common
and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with
linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation.
This paper proposes establishing a 1,256 Legal Analogical Reasoning Questions Set (LARQS) from the
2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the
accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be
ubiquitous in the word embedding model.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
This document describes the development of a new legal word embedding evaluation dataset for Chinese called LARQS (Legal Analogical Reasoning Questions Set). It was created using a corpus of 2,388 Chinese legal documents and contains 1,256 questions evaluating 5 categories of legal relationships. The document discusses word embedding and existing evaluation benchmarks. It then describes how LARQS was created by legal experts and its potential usefulness compared to general-purpose benchmarks for evaluating legal-domain word embeddings.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
More Related Content
Similar to Enhanced TextRank using weighted word embedding for text summarization
8 efficient multi-document summary generation using neural networkINFOGAIN PUBLICATION
This paper proposes a multi-document summarization system that uses bisect k-means clustering, an optimal merge function, and a neural network. The system first preprocesses input documents through stemming and removing stop words. It then applies bisect k-means clustering to group similar sentences. The clusters are merged using an optimal merge function to find important keywords. The NEWSUM algorithm is used to generate a primary summary for each keyword. A neural network trained on sentence classifications is then used to classify sentences in the primary summary as positive or negative. Only positively classified sentences are included in the final summary to improve accuracy. The system aims to generate a concise and accurate summary in a short period of time from multiple documents on a given topic.
An in-depth review on News Classification through NLPIRJET Journal
This document provides an in-depth literature review of news classification through natural language processing (NLP). It discusses several existing approaches to news classification, including models that use convolutional neural networks (CNNs), graph-based approaches, and attention mechanisms. The document also notes that current search engines often return too many irrelevant results, so classification could help layer search results. It concludes that while many techniques have been developed, inconsistencies remain in effectively classifying news, so further research on combining NLP, feature extraction, and fuzzy logic is needed.
This document discusses an attempt to create an extractive automatic text summarizer. It splits document paragraphs into sentences and ranks the sentences based on summarization features, with higher ranked sentences considered more important for generating the summary. The proposed system uses the TextRank algorithm to rank sentences based on graph-based features. The paper presents the TextRank approach and compares the proposed system to existing MS Word summarization methods. Evaluation measures are also described to assess the performance of the summarizer.
A hybrid approach for text summarization using semantic latent Dirichlet allo...IJECEIAES
Automatic text summarization generates a summary that contains sentences reflecting the essential and relevant information of the original documents. Extractive summarization requires semantic understanding, while abstractive summarization requires a better intermediate text representation. This paper proposes a hybrid approach for generating text summaries that combine extractive and abstractive methods. To improve the semantic understanding of the model, we propose two novel extractive methods: semantic latent Dirichlet allocation (semantic LDA) and sentence concept mapping. We then generate an intermediate summary by applying our proposed sentence ranking algorithm over the sentence concept mapping. This intermediate summary is input to a transformer-based abstractive model fine-tuned with a multi-head attention mechanism. Our experimental results demonstrate that the proposed hybrid model generates coherent summaries using the intermediate extractive summary covering semantics. As we increase the concepts and number of words in the summary the rouge scores are improved for precision and F1 scores in our proposed model.
Improving Text Categorization with Semantic Knowledge in Wikipediachjshan
Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimensional. In traditional text classification methods, document texts are represented with “Bag of Words (BOW)” text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wikipedia concepts and used the Wikipedia-concept-based document representation method to take the place of traditional BOW model for text classification. In order to overcome the weakness of ignoring the semantic relationships among terms in document representation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based document representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method.
Deep sequential pattern mining for readability enhancement of Indonesian summ...IJECEIAES
In text summarization research, readability is a great issue that must be addressed. Our hypothesis is readability can be accomplished by using text representations that keep the meaning of text documents intact. Therefore, this study aims to combine sequential pattern mining (SPM) in producing a sequence of a word as text representation with unsupervised deep learning to produce an Indonesian text summary called DeepSPM. This research uses PrefixSpan as an SPM algorithm and deep belief network (DBN) as an unsupervised deep learning method. This research uses 18,774 Indonesian news text from IndoSum. The readability aspect is evaluated by recall- oriented understudy for gisting evaluation (ROUGE) as a co-selection-based analysis; Dwiyanto Djoko Pranowo metrics, Gunning fog index (GFI), and Flesch-Kincaid grade level (FKGL) as content-based analysis; and human readability evaluation with two experts. The experiment result shows that DeepSPM yields better than DBN, with the F-measure value of ROUGE-1 enhanced to 0.462, ROUGE-2 is 0.37, and ROUGE-L is 0.41. The significance of ROUGE results also be tested using T-Test. The content- based analysis and human readability evaluation findings are conformable with the findings of co-selection-based analysis that generated summaries are only partially readable or have a medium level of readability aspect.
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
IRJET- A Survey Paper on Text Summarization MethodsIRJET Journal
This document summarizes various approaches for automatic text summarization, including extractive and abstractive methods. It discusses early surface-level approaches from the 1950s that identified important sentences based on word frequency. It also reviews corpus-based, cohesion-based, rhetoric-based, and graph-based approaches. The document then examines single document summarization techniques like naive Bayes methods, log-linear models, and deep natural language analysis. It concludes with a discussion of multi-document summarization, including abstraction and information fusion as well as graph spreading activation approaches. The goal of the survey is to provide an overview of the major existing methods that have been used for automatic text summarization.
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents.
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATIONIJNSA Journal
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents.
Myanmar news summarization using different word representations IJECEIAES
There is enormous amount information available in different forms of sources and genres. In order to extract useful information from a massive amount of data, automatic mechanism is required. The text summarization systems assist with content reduction keeping the important information and filtering the non-important parts of the text. Good document representation is really important in text summarization to get relevant information. Bag-ofwords cannot give word similarity on syntactic and semantic relationship. Word embedding can give good document representation to capture and encode the semantic relation between words. Therefore, centroid based on word embedding representation is employed in this paper. Myanmar news summarization based on different word embedding is proposed. In this paper, Myanmar local and international news are summarized using centroid-based word embedding summarizer using the effectiveness of word representation approach, word embedding. Experiments were done on Myanmar local and international news dataset using different word embedding models and the results are compared with performance of bag-of-words summarization. Centroid summarization using word embedding performs comprehensively better than centroid summarization using bag-of-words.
Conceptual framework for abstractive text summarizationijnlc
As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
Automatic Text Summarization using Natural Language ProcessingIRJET Journal
The document discusses automatic text summarization using natural language processing. It describes using the Simplified Lesk calculation and WordNet to assess the importance of sentences in a document and perform word sense disambiguation. The proposed approach assesses sentence weights using Simplified Lesk and orders them by weight. It then selects sentences for the summary based on a given summarization percentage. The approach provides good results for summaries up to 50% of the original text and acceptable results up to 25%.
IRJET- Sewage Treatment Potential of Coir Geotextiles in Conjunction with Act...IRJET Journal
The document discusses using machine learning models for extractive text summarization, which involves selecting important sentences from a document to provide an overview. Five machine learning models are explored: an artificial neural network, naive Bayes classifier, support vector machine, and two convolutional neural networks. The models are used to classify sentences as important or not important based on features like position, length, terms, and similarity to other sentences. The models' performance on this text highlighting task is compared.
IRJET- Text Highlighting – A Machine Learning ApproachIRJET Journal
This document discusses using machine learning models for extractive text summarization. It explores five models - artificial neural network, naive Bayes classifier, support vector machine, and two convolutional neural networks. It also proposes a new method to generate extractive summarization datasets from human summaries. The models are trained and tested on a dataset generated from CNN/Daily Mail articles, with the convolutional neural network approach achieving the highest accuracy.
Latent semantic analysis and cosine similarity for hadith search engineTELKOMNIKA JOURNAL
Search engine technology was used to find information as needed easily, quickly and efficiently, including in searching the information about the hadith which was a second guideline of life for muslim besides the Holy Qur'an. This study was aim to build a specialized search engine to find information about a complete and eleven hadith in Indonesian language. In this research, search engines worked by using latent semantic analysis (LSA) and cosine similarity based on the keywords entered. The LSA and cosine similarity methods were used in forming structured representations of text data as well as calculating the similarity of the keyword text entered with hadith text data, so the hadith information was issued in accordance with what was searched. Based on the results of the test conducted 50 times, it indicated that the LSA and cosine similarity had a success rate in finding high hadith information with an average recall value was 87.83%, although from all information obtained level of precision hadith was found semantically not many, it was indicated by the average precision value was 36.25%.
Towards optimize-ESA for text semantic similarity: A case study of biomedical...IJECEIAES
Explicit Semantic Analysis (ESA) is an approach to measure the semantic relatedness between terms or documents based on similarities to documents of a references corpus usually Wikipedia. ESA usage has received tremendous attention in the field of natural language processing NLP and information retrieval. However, ESA utilizes a huge Wikipedia index matrix in its interpretation by multiplying a large matrix by a term vector to produce a high-dimensional vector. Consequently, the ESA process is too expensive in interpretation and similarity steps. Therefore, the efficiency of ESA will slow down because we lose a lot of time in unnecessary operations. This paper propose enhancements to ESA called optimize-ESA that reduce the dimension at the interpretation stage by computing the semantic similarity in a specific domain. The experimental results show clearly that our method correlates much better with human judgement than the full version ESA approach.
Different Similarity Measures for Text Classification Using KnnIOSR Journals
This document summarizes research on classifying textual data using the k-nearest neighbors (KNN) algorithm and different similarity measures. It explores generating 9 different vector representations of text documents and using KNN with similarity measures like Euclidean, Manhattan, squared Euclidean, etc. to classify documents. The researchers tested KNN on a Reuters news corpus with 5,485 training documents across 8 classes and found that normalization and k=4 produced the best accuracy of 94.47%. They conclude KNN with different similarity measures and vector representations is effective for multi-class text classification.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
Applying natural language processing-related algorithms is currently a popular project in legal
applications, for instance, document classification of legal documents, contract review and machine
translation. Using the above machine learning algorithms, all need to encode the words in the document in
the form of vectors. The word embedding model is a modern distributed word representation approach and
the most common unsupervised word encoding method. It facilitates subjecting other algorithms and
subsequently performing the downstream tasks of natural language processing vis-à-vis. The most common
and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with
linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation.
This paper proposes establishing a 1,256 Legal Analogical Reasoning Questions Set (LARQS) from the
2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the
accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be
ubiquitous in the word embedding model.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
This document describes the development of a new legal word embedding evaluation dataset for Chinese called LARQS (Legal Analogical Reasoning Questions Set). It was created using a corpus of 2,388 Chinese legal documents and contains 1,256 questions evaluating 5 categories of legal relationships. The document discusses word embedding and existing evaluation benchmarks. It then describes how LARQS was created by legal experts and its potential usefulness compared to general-purpose benchmarks for evaluating legal-domain word embeddings.
Similar to Enhanced TextRank using weighted word embedding for text summarization (20)
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Neural network optimizer of proportional-integral-differential controller par...IJECEIAES
Wide application of proportional-integral-differential (PID)-regulator in industry requires constant improvement of methods of its parameters adjustment. The paper deals with the issues of optimization of PID-regulator parameters with the use of neural network technology methods. A methodology for choosing the architecture (structure) of neural network optimizer is proposed, which consists in determining the number of layers, the number of neurons in each layer, as well as the form and type of activation function. Algorithms of neural network training based on the application of the method of minimizing the mismatch between the regulated value and the target value are developed. The method of back propagation of gradients is proposed to select the optimal training rate of neurons of the neural network. The neural network optimizer, which is a superstructure of the linear PID controller, allows increasing the regulation accuracy from 0.23 to 0.09, thus reducing the power consumption from 65% to 53%. The results of the conducted experiments allow us to conclude that the created neural superstructure may well become a prototype of an automatic voltage regulator (AVR)-type industrial controller for tuning the parameters of the PID controller.
An improved modulation technique suitable for a three level flying capacitor ...IJECEIAES
This research paper introduces an innovative modulation technique for controlling a 3-level flying capacitor multilevel inverter (FCMLI), aiming to streamline the modulation process in contrast to conventional methods. The proposed
simplified modulation technique paves the way for more straightforward and
efficient control of multilevel inverters, enabling their widespread adoption and
integration into modern power electronic systems. Through the amalgamation of
sinusoidal pulse width modulation (SPWM) with a high-frequency square wave
pulse, this controlling technique attains energy equilibrium across the coupling
capacitor. The modulation scheme incorporates a simplified switching pattern
and a decreased count of voltage references, thereby simplifying the control
algorithm.
A review on features and methods of potential fishing zoneIJECEIAES
This review focuses on the importance of identifying potential fishing zones in seawater for sustainable fishing practices. It explores features like sea surface temperature (SST) and sea surface height (SSH), along with classification methods such as classifiers. The features like SST, SSH, and different classifiers used to classify the data, have been figured out in this review study. This study underscores the importance of examining potential fishing zones using advanced analytical techniques. It thoroughly explores the methodologies employed by researchers, covering both past and current approaches. The examination centers on data characteristics and the application of classification algorithms for classification of potential fishing zones. Furthermore, the prediction of potential fishing zones relies significantly on the effectiveness of classification algorithms. Previous research has assessed the performance of models like support vector machines, naïve Bayes, and artificial neural networks (ANN). In the previous result, the results of support vector machine (SVM) were 97.6% more accurate than naive Bayes's 94.2% to classify test data for fisheries classification. By considering the recent works in this area, several recommendations for future works are presented to further improve the performance of the potential fishing zone models, which is important to the fisheries community.
Electrical signal interference minimization using appropriate core material f...IJECEIAES
As demand for smaller, quicker, and more powerful devices rises, Moore's law is strictly followed. The industry has worked hard to make little devices that boost productivity. The goal is to optimize device density. Scientists are reducing connection delays to improve circuit performance. This helped them understand three-dimensional integrated circuit (3D IC) concepts, which stack active devices and create vertical connections to diminish latency and lower interconnects. Electrical involvement is a big worry with 3D integrates circuits. Researchers have developed and tested through silicon via (TSV) and substrates to decrease electrical wave involvement. This study illustrates a novel noise coupling reduction method using several electrical involvement models. A 22% drop in electrical involvement from wave-carrying to victim TSVs introduces this new paradigm and improves system performance even at higher THz frequencies.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Bibliometric analysis highlighting the role of women in addressing climate ch...IJECEIAES
Fossil fuel consumption increased quickly, contributing to climate change
that is evident in unusual flooding and draughts, and global warming. Over
the past ten years, women's involvement in society has grown dramatically,
and they succeeded in playing a noticeable role in reducing climate change.
A bibliometric analysis of data from the last ten years has been carried out to
examine the role of women in addressing the climate change. The analysis's
findings discussed the relevant to the sustainable development goals (SDGs),
particularly SDG 7 and SDG 13. The results considered contributions made
by women in the various sectors while taking geographic dispersion into
account. The bibliometric analysis delves into topics including women's
leadership in environmental groups, their involvement in policymaking, their
contributions to sustainable development projects, and the influence of
gender diversity on attempts to mitigate climate change. This study's results
highlight how women have influenced policies and actions related to climate
change, point out areas of research deficiency and recommendations on how
to increase role of the women in addressing the climate change and
achieving sustainability. To achieve more successful results, this initiative
aims to highlight the significance of gender equality and encourage
inclusivity in climate change decision-making processes.
Voltage and frequency control of microgrid in presence of micro-turbine inter...IJECEIAES
The active and reactive load changes have a significant impact on voltage
and frequency. In this paper, in order to stabilize the microgrid (MG) against
load variations in islanding mode, the active and reactive power of all
distributed generators (DGs), including energy storage (battery), diesel
generator, and micro-turbine, are controlled. The micro-turbine generator is
connected to MG through a three-phase to three-phase matrix converter, and
the droop control method is applied for controlling the voltage and
frequency of MG. In addition, a method is introduced for voltage and
frequency control of micro-turbines in the transition state from gridconnected mode to islanding mode. A novel switching strategy of the matrix
converter is used for converting the high-frequency output voltage of the
micro-turbine to the grid-side frequency of the utility system. Moreover,
using the switching strategy, the low-order harmonics in the output current
and voltage are not produced, and consequently, the size of the output filter
would be reduced. In fact, the suggested control strategy is load-independent
and has no frequency conversion restrictions. The proposed approach for
voltage and frequency regulation demonstrates exceptional performance and
favorable response across various load alteration scenarios. The suggested
strategy is examined in several scenarios in the MG test systems, and the
simulation results are addressed.
Enhancing battery system identification: nonlinear autoregressive modeling fo...IJECEIAES
Precisely characterizing Li-ion batteries is essential for optimizing their
performance, enhancing safety, and prolonging their lifespan across various
applications, such as electric vehicles and renewable energy systems. This
article introduces an innovative nonlinear methodology for system
identification of a Li-ion battery, employing a nonlinear autoregressive with
exogenous inputs (NARX) model. The proposed approach integrates the
benefits of nonlinear modeling with the adaptability of the NARX structure,
facilitating a more comprehensive representation of the intricate
electrochemical processes within the battery. Experimental data collected
from a Li-ion battery operating under diverse scenarios are employed to
validate the effectiveness of the proposed methodology. The identified
NARX model exhibits superior accuracy in predicting the battery's behavior
compared to traditional linear models. This study underscores the
importance of accounting for nonlinearities in battery modeling, providing
insights into the intricate relationships between state-of-charge, voltage, and
current under dynamic conditions.
Smart grid deployment: from a bibliometric analysis to a surveyIJECEIAES
Smart grids are one of the last decades' innovations in electrical energy.
They bring relevant advantages compared to the traditional grid and
significant interest from the research community. Assessing the field's
evolution is essential to propose guidelines for facing new and future smart
grid challenges. In addition, knowing the main technologies involved in the
deployment of smart grids (SGs) is important to highlight possible
shortcomings that can be mitigated by developing new tools. This paper
contributes to the research trends mentioned above by focusing on two
objectives. First, a bibliometric analysis is presented to give an overview of
the current research level about smart grid deployment. Second, a survey of
the main technological approaches used for smart grid implementation and
their contributions are highlighted. To that effect, we searched the Web of
Science (WoS), and the Scopus databases. We obtained 5,663 documents
from WoS and 7,215 from Scopus on smart grid implementation or
deployment. With the extraction limitation in the Scopus database, 5,872 of
the 7,215 documents were extracted using a multi-step process. These two
datasets have been analyzed using a bibliometric tool called bibliometrix.
The main outputs are presented with some recommendations for future
research.
Use of analytical hierarchy process for selecting and prioritizing islanding ...IJECEIAES
One of the problems that are associated to power systems is islanding
condition, which must be rapidly and properly detected to prevent any
negative consequences on the system's protection, stability, and security.
This paper offers a thorough overview of several islanding detection
strategies, which are divided into two categories: classic approaches,
including local and remote approaches, and modern techniques, including
techniques based on signal processing and computational intelligence.
Additionally, each approach is compared and assessed based on several
factors, including implementation costs, non-detected zones, declining
power quality, and response times using the analytical hierarchy process
(AHP). The multi-criteria decision-making analysis shows that the overall
weight of passive methods (24.7%), active methods (7.8%), hybrid methods
(5.6%), remote methods (14.5%), signal processing-based methods (26.6%),
and computational intelligent-based methods (20.8%) based on the
comparison of all criteria together. Thus, it can be seen from the total weight
that hybrid approaches are the least suitable to be chosen, while signal
processing-based methods are the most appropriate islanding detection
method to be selected and implemented in power system with respect to the
aforementioned factors. Using Expert Choice software, the proposed
hierarchy model is studied and examined.
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...IJECEIAES
The power generated by photovoltaic (PV) systems is influenced by
environmental factors. This variability hampers the control and utilization of
solar cells' peak output. In this study, a single-stage grid-connected PV
system is designed to enhance power quality. Our approach employs fuzzy
logic in the direct power control (DPC) of a three-phase voltage source
inverter (VSI), enabling seamless integration of the PV connected to the
grid. Additionally, a fuzzy logic-based maximum power point tracking
(MPPT) controller is adopted, which outperforms traditional methods like
incremental conductance (INC) in enhancing solar cell efficiency and
minimizing the response time. Moreover, the inverter's real-time active and
reactive power is directly managed to achieve a unity power factor (UPF).
The system's performance is assessed through MATLAB/Simulink
implementation, showing marked improvement over conventional methods,
particularly in steady-state and varying weather conditions. For solar
irradiances of 500 and 1,000 W/m2
, the results show that the proposed
method reduces the total harmonic distortion (THD) of the injected current
to the grid by approximately 46% and 38% compared to conventional
methods, respectively. Furthermore, we compare the simulation results with
IEEE standards to evaluate the system's grid compatibility.
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...IJECEIAES
Photovoltaic systems have emerged as a promising energy resource that
caters to the future needs of society, owing to their renewable, inexhaustible,
and cost-free nature. The power output of these systems relies on solar cell
radiation and temperature. In order to mitigate the dependence on
atmospheric conditions and enhance power tracking, a conventional
approach has been improved by integrating various methods. To optimize
the generation of electricity from solar systems, the maximum power point
tracking (MPPT) technique is employed. To overcome limitations such as
steady-state voltage oscillations and improve transient response, two
traditional MPPT methods, namely fuzzy logic controller (FLC) and perturb
and observe (P&O), have been modified. This research paper aims to
simulate and validate the step size of the proposed modified P&O and FLC
techniques within the MPPT algorithm using MATLAB/Simulink for
efficient power tracking in photovoltaic systems.
Adaptive synchronous sliding control for a robot manipulator based on neural ...IJECEIAES
Robot manipulators have become important equipment in production lines, medical fields, and transportation. Improving the quality of trajectory tracking for
robot hands is always an attractive topic in the research community. This is a
challenging problem because robot manipulators are complex nonlinear systems
and are often subject to fluctuations in loads and external disturbances. This
article proposes an adaptive synchronous sliding control scheme to improve trajectory tracking performance for a robot manipulator. The proposed controller
ensures that the positions of the joints track the desired trajectory, synchronize
the errors, and significantly reduces chattering. First, the synchronous tracking
errors and synchronous sliding surfaces are presented. Second, the synchronous
tracking error dynamics are determined. Third, a robust adaptive control law is
designed,the unknown components of the model are estimated online by the neural network, and the parameters of the switching elements are selected by fuzzy
logic. The built algorithm ensures that the tracking and approximation errors
are ultimately uniformly bounded (UUB). Finally, the effectiveness of the constructed algorithm is demonstrated through simulation and experimental results.
Simulation and experimental results show that the proposed controller is effective with small synchronous tracking errors, and the chattering phenomenon is
significantly reduced.
Remote field-programmable gate array laboratory for signal acquisition and de...IJECEIAES
A remote laboratory utilizing field-programmable gate array (FPGA) technologies enhances students’ learning experience anywhere and anytime in embedded system design. Existing remote laboratories prioritize hardware access and visual feedback for observing board behavior after programming, neglecting comprehensive debugging tools to resolve errors that require internal signal acquisition. This paper proposes a novel remote embeddedsystem design approach targeting FPGA technologies that are fully interactive via a web-based platform. Our solution provides FPGA board access and debugging capabilities beyond the visual feedback provided by existing remote laboratories. We implemented a lab module that allows users to seamlessly incorporate into their FPGA design. The module minimizes hardware resource utilization while enabling the acquisition of a large number of data samples from the signal during the experiments by adaptively compressing the signal prior to data transmission. The results demonstrate an average compression ratio of 2.90 across three benchmark signals, indicating efficient signal acquisition and effective debugging and analysis. This method allows users to acquire more data samples than conventional methods. The proposed lab allows students to remotely test and debug their designs, bridging the gap between theory and practice in embedded system design.
Detecting and resolving feature envy through automated machine learning and m...IJECEIAES
Efficiently identifying and resolving code smells enhances software project quality. This paper presents a novel solution, utilizing automated machine learning (AutoML) techniques, to detect code smells and apply move method refactoring. By evaluating code metrics before and after refactoring, we assessed its impact on coupling, complexity, and cohesion. Key contributions of this research include a unique dataset for code smell classification and the development of models using AutoGluon for optimal performance. Furthermore, the study identifies the top 20 influential features in classifying feature envy, a well-known code smell, stemming from excessive reliance on external classes. We also explored how move method refactoring addresses feature envy, revealing reduced coupling and complexity, and improved cohesion, ultimately enhancing code quality. In summary, this research offers an empirical, data-driven approach, integrating AutoML and move method refactoring to optimize software project quality. Insights gained shed light on the benefits of refactoring on code quality and the significance of specific features in detecting feature envy. Future research can expand to explore additional refactoring techniques and a broader range of code metrics, advancing software engineering practices and standards.
Smart monitoring technique for solar cell systems using internet of things ba...IJECEIAES
Rapidly and remotely monitoring and receiving the solar cell systems status parameters, solar irradiance, temperature, and humidity, are critical issues in enhancement their efficiency. Hence, in the present article an improved smart prototype of internet of things (IoT) technique based on embedded system through NodeMCU ESP8266 (ESP-12E) was carried out experimentally. Three different regions at Egypt; Luxor, Cairo, and El-Beheira cities were chosen to study their solar irradiance profile, temperature, and humidity by the proposed IoT system. The monitoring data of solar irradiance, temperature, and humidity were live visualized directly by Ubidots through hypertext transfer protocol (HTTP) protocol. The measured solar power radiation in Luxor, Cairo, and El-Beheira ranged between 216-1000, 245-958, and 187-692 W/m 2 respectively during the solar day. The accuracy and rapidity of obtaining monitoring results using the proposed IoT system made it a strong candidate for application in monitoring solar cell systems. On the other hand, the obtained solar power radiation results of the three considered regions strongly candidate Luxor and Cairo as suitable places to build up a solar cells system station rather than El-Beheira.
An efficient security framework for intrusion detection and prevention in int...IJECEIAES
Over the past few years, the internet of things (IoT) has advanced to connect billions of smart devices to improve quality of life. However, anomalies or malicious intrusions pose several security loopholes, leading to performance degradation and threat to data security in IoT operations. Thereby, IoT security systems must keep an eye on and restrict unwanted events from occurring in the IoT network. Recently, various technical solutions based on machine learning (ML) models have been derived towards identifying and restricting unwanted events in IoT. However, most ML-based approaches are prone to miss-classification due to inappropriate feature selection. Additionally, most ML approaches applied to intrusion detection and prevention consider supervised learning, which requires a large amount of labeled data to be trained. Consequently, such complex datasets are impossible to source in a large network like IoT. To address this problem, this proposed study introduces an efficient learning mechanism to strengthen the IoT security aspects. The proposed algorithm incorporates supervised and unsupervised approaches to improve the learning models for intrusion detection and mitigation. Compared with the related works, the experimental outcome shows that the model performs well in a benchmark dataset. It accomplishes an improved detection accuracy of approximately 99.21%.
Supermarket Management System Project Report.pdfKamal Acharya
Supermarket management is a stand-alone J2EE using Eclipse Juno program.
This project contains all the necessary required information about maintaining
the supermarket billing system.
The core idea of this project to minimize the paper work and centralize the
data. Here all the communication is taken in secure manner. That is, in this
application the information will be stored in client itself. For further security the
data base is stored in the back-end oracle and so no intruders can access it.
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Transcat
Join us for this solutions-based webinar on the tools and techniques for commissioning and maintaining PV Systems. In this session, we'll review the process of building and maintaining a solar array, starting with installation and commissioning, then reviewing operations and maintenance of the system. This course will review insulation resistance testing, I-V curve testing, earth-bond continuity, ground resistance testing, performance tests, visual inspections, ground and arc fault testing procedures, and power quality analysis.
Fluke Solar Application Specialist Will White is presenting on this engaging topic:
Will has worked in the renewable energy industry since 2005, first as an installer for a small east coast solar integrator before adding sales, design, and project management to his skillset. In 2022, Will joined Fluke as a solar application specialist, where he supports their renewable energy testing equipment like IV-curve tracers, electrical meters, and thermal imaging cameras. Experienced in wind power, solar thermal, energy storage, and all scales of PV, Will has primarily focused on residential and small commercial systems. He is passionate about implementing high-quality, code-compliant installation techniques.
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...DharmaBanothu
The Network on Chip (NoC) has emerged as an effective
solution for intercommunication infrastructure within System on
Chip (SoC) designs, overcoming the limitations of traditional
methods that face significant bottlenecks. However, the complexity
of NoC design presents numerous challenges related to
performance metrics such as scalability, latency, power
consumption, and signal integrity. This project addresses the
issues within the router's memory unit and proposes an enhanced
memory structure. To achieve efficient data transfer, FIFO buffers
are implemented in distributed RAM and virtual channels for
FPGA-based NoC. The project introduces advanced FIFO-based
memory units within the NoC router, assessing their performance
in a Bi-directional NoC (Bi-NoC) configuration. The primary
objective is to reduce the router's workload while enhancing the
FIFO internal structure. To further improve data transfer speed,
a Bi-NoC with a self-configurable intercommunication channel is
suggested. Simulation and synthesis results demonstrate
guaranteed throughput, predictable latency, and equitable
network access, showing significant improvement over previous
designs
Blood finder application project report (1).pdfKamal Acharya
Blood Finder is an emergency time app where a user can search for the blood banks as
well as the registered blood donors around Mumbai. This application also provide an
opportunity for the user of this application to become a registered donor for this user have
to enroll for the donor request from the application itself. If the admin wish to make user
a registered donor, with some of the formalities with the organization it can be done.
Specialization of this application is that the user will not have to register on sign-in for
searching the blood banks and blood donors it can be just done by installing the
application to the mobile.
The purpose of making this application is to save the user’s time for searching blood of
needed blood group during the time of the emergency.
This is an android application developed in Java and XML with the connectivity of
SQLite database. This application will provide most of basic functionality required for an
emergency time application. All the details of Blood banks and Blood donors are stored
in the database i.e. SQLite.
This application allowed the user to get all the information regarding blood banks and
blood donors such as Name, Number, Address, Blood Group, rather than searching it on
the different websites and wasting the precious time. This application is effective and
user friendly.
Build the Next Generation of Apps with the Einstein 1 Platform.
Rejoignez Philippe Ozil pour une session de workshops qui vous guidera à travers les détails de la plateforme Einstein 1, l'importance des données pour la création d'applications d'intelligence artificielle et les différents outils et technologies que Salesforce propose pour vous apporter tous les bénéfices de l'IA.
We have designed & manufacture the Lubi Valves LBF series type of Butterfly Valves for General Utility Water applications as well as for HVAC applications.
Enhanced TextRank using weighted word embedding for text summarization
1. International Journal of Electrical and Computer Engineering (IJECE)
Vol. 13, No. 5, October 2023, pp. 5472~5482
ISSN: 2088-8708, DOI: 10.11591/ijece.v13i5.pp5472-5482 5472
Journal homepage: http://ijece.iaescore.com
Enhanced TextRank using weighted word embedding for text
summarization
Evi Yulianti, Nicholas Pangestu, Meganingrum Arista Jiwanggi
Faculty of Computer Science, Universitas Indonesia, Depok, Indonesia
Article Info ABSTRACT
Article history:
Received Feb 4, 2023
Revised Mar 15, 2023
Accepted Apr 7, 2023
The length of a news article may influence people’s interest to read the
article. In this case, text summarization can help to create a shorter
representative version of an article to reduce people’s read time. This paper
proposes to use weighted word embedding based on Word2Vec, FastText,
and bidirectional encoder representations from transformers (BERT) models
to enhance the TextRank summarization algorithm. The use of weighted
word embedding is aimed to create better sentence representation, in order to
produce more accurate summaries. The results show that using (unweighted)
word embedding significantly improves the performance of the TextRank
algorithm, with the best performance gained by the summarization system
using BERT word embedding. When each word embedding is weighed
using term frequency-inverse document frequency (TF-IDF), the
performance for all systems using unweighted word embedding further
significantly improve, with the biggest improvement achieved by the
systems using Word2Vec (with 6.80% to 12.92% increase) and FastText
(with 7.04% to 12.78% increase). Overall, our systems using weighted word
embedding can outperform the TextRank method by up to 17.33% in
ROUGE-1 and 30.01% in ROUGE-2. This demonstrates the effectiveness of
weighted word embedding in the TextRank algorithm for text
summarization.
Keywords:
Bidirectional encoder
representations from
transformers
FastText
Term frequency-inverse
document frequency
Text summarization
TextRank
Weighted word embedding
Word2Vec
This is an open access article under the CC BY-SA license.
Corresponding Author:
Nicholas Pangestu
Faculty of Computer Science, Universitas Indonesia
Depok, West Java, Indonesia
Email: pangestu.nicholas@gmail.com
1. INTRODUCTION
According to the Datareportal statistics in 2022 [1], Indonesia is a country where 204 million (73%)
of the population use internet. Every day, Indonesians spend an average of 20 percent of their internet time
reading the news. Their interest in reading the news is also supported by the increasing number of Indonesian
online news portals, which is estimated to reach 43,300 media based on the Press Council statistics in 2017
presented in [2]. Even though the public’s interest in reading the news is quite large, reading long stories can
take a long time. It then may discourage them to read through the news, especially because people in the
digital era tend to expect to find information from various sources quickly [3]. As a result of this tendency,
we often see the term “too long; didn’t read” or “tl:dr” which is an outline or summary of writings that are
spread on the internet. This term is aimed to let other people quickly understand the main point of the text.
The presence of the term “tl;dr” itself also shows that the length of a text affects one’s reading interest.
It is beneficial if a lengthy text can be condensed into a more compact form, i.e., summary, so that it
could save time and effort of readers to find important information from the text [4]. According to
Radev et al. [5], a summary is a text produced from one or more texts, which conveys important information
2. Int J Elec & Comp Eng ISSN: 2088-8708
Enhanced TextRank using weighted word embedding for text summarization (Evi Yulianti)
5473
from them and the length is normally less than half the length of the original text. A specific research field
that explores the task of generating summaries from textual documents is called text summarization.
According to the way the summary is generated, text summarization can be categorized as extractive and
abstractive summarization [6]. In the former type of summarization, the method only extracts sentences from
the original text, while in the latter type of summarization, some additional processes are performed to the
extracted sentences which makes the resulting summary contain different sentences from the text [7].
TextRank [8] is an extractive summarization method that uses graph to capture the relationship
between sentences. and use this relationship to identify the importance of sentences. TextRank is an
unsupervised method that has been used in many previous work on summarization, either as main or baseline
method, because of its proven effectiveness [8]–[11]. Previously, many studies have also been conducted to
enhance the performance of the TextRank algorithm for text summarization or keyword extraction, such as
using word embedding [12]–[18], term frequency-inverse document frequency (TF-IDF) [19], [20], the
combination of 1 gram, 2 gram, and Hidden Markov models [21], knowledge graph sentence embedding and
K-means clustering [22], statistical and linguistic features for sentence weighting [23], variation of sentence
similarity functions [24], and fine-tuning the hyperparameters [13].
In this work, we propose to use weighted word embedding using a variation of word embeddings,
and TF-IDF weighting to enhance the effectiveness of the TextRank algorithm. Word embedding is used to
obtain a better word representation that can capture semantic information, resulting in better sentence
representation. Then, TF-IDF is used to weigh the generated word embedding to further improve sentence
representation; this is based on our intuition that more important words, which are estimated using corpus
statistics, should be valued more when generating the vector representation of sentences. TF-IDF is chosen
because it has been shown to perform well for term weighting in various natural language processing tasks
[19], [20], [25]–[27], and it also has been proven to significantly outperform the bag-of-words (BoW)
technique [28]. The combination of word embedding and TF-IDF weighting components are then expected to
improve the estimation of sentence relationships in the TextRank algorithm. We use two traditional word
embedding models, i.e., Word2Vec [29] and FastText [30]), and one contextualized word embedding model,
i.e., bidirectional encoder representations from transformers (BERT) [31] that has been pre-trained using a
large-scale Indonesian corpus, referred to as Indonesian BERT (IndoBERT)-based model [32].
We divide some previous works on text summarization that are related to our work into three
groups. First, the works that used word embedding in TextRank. Second, the works that used TF-IDF
weighting in TextRank. Third, the works that used weighted word embedding. They are explained in more
detail in the following three paragraphs. To the best of our knowledge, none of the previous works has
incorporated weighted word embedding in TextRank summarization algorithm. In addition, none of them
have investigated the use of contextualized word embedding from IndoBERT in TextRank for text
summarization. Therefore, these become the research gaps that will be filled in this work.
Many attention has been given to the use of word embedding in TextRank algorithm for text
summarization [12]–[16]. These works use different kinds of word embeddings, such as Global Vectors for
word representation (GloVe) [12], [14], [17], Word2Vec [13], [14], FastText [14], and sentence-BERT
(SBERT) [15], [16]. All of these works, however, did not use weighted word embedding, so the embedding is
not weighed according to the collection statistics. In addition, they also did not investigate the contextual
word embedding from IndoBERT, which becomes one of the words embedding studied in this work to be
incorporated with TextRank. These things differentiate between our work and their work.
Using TextRank and TF-IDF to generate extractive summaries from documents has been explored
by relatively a few studies [19], [20]. Zaware et al. [19] used TF-IDF as a vector representation of sentences
to be inputted into the TextRank algorithm. More specifically, a sentence is represented using an n-
dimensional vector space containing the TF-IDF scores of words. Different from them, Guan et al. [20] did
not incorporate TF-IDF and TextRank, but they used them separately to extract keywords from the text and
the comments given to the text, respectively. The resulting keywords were then used to score the sentences.
In contrast to Zaware et al. [19] and Guan et al. [20] we use TF-IDF to weigh the word embedding to
improve the vector representation of sentences.
There are relatively very few works that have investigated weighted word embedding for summarization
[11], [33], but they did not use them with TextRank algorithm. Rani and Labiyal [11] used weighted word
embedding, combined with a set of linguistic and statistical features to rank sentences in the clusters generated
using K-means algorithm. You et al. [33] explored the use of IDF weight and Word2Vec embedding as one of the
reconstruction mechanisms in their sequence-to-sequence summarization model. Our work is different to them as
we incorporate the weighted word embedding into the TextRank summarization algorithm.
Our research questions in this work are as follows: i) How is the effectiveness of using word
embedding from Word2Vec, FastText, and BERT in the TextRank algorithm for extracting text summaries in
the Indonesian dataset? and ii) To what extent does the performance of the summarization systems differ
when the word embeddings are weighed using TF-IDF? The rest of the paper is then organized as follows.
3. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5472-5482
5474
Section 2 describes the dataset, system framework, and evaluation method. Section 3 describes our
experimental results together with the analysis. At last, section 4 concludes our study.
2. METHOD
2.1. Dataset
This work uses Liputan6 dataset [9], a large-scale Indonesian dataset for text summarization. The
documents in this dataset come from news articles on the Indonesian online news portal Liputan6. It covers a
variety of topics, such as: politics, business, and technology. This dataset was automatically built by
Koto et al. [9], extracting the short description contained in the webpage metadata for each news article in the
Liputan6 website, as the ground truth summary for the article. We use the Canonical set of the Liputan6 dataset,
which is a complete version of this dataset, to obtain more comprehensive results in our experiment. Since the
main summarization method used in this work is unsupervised, i.e., TextRank algorithm, we then only use the
test split of the dataset which consists of 10,972 documents. The statistics of this dataset is presented in Table 1.
Table 1. Statistics of our dataset
Characteristic Statistic
Total documents 10,972 documents
Average length of document (in sentences) 11.71 sentences
Average length of document (in words) 218.55 words
Average length of document sentence 18.67 words
Average length of ground truth summary (in sentences) 2.06 sentences
Average length of ground truth summary (in words) 26.20 words
It appears from the Table 1 that the documents in our dataset are short to moderate in length, which
is around 12 sentences per document. The ground truth summary is also relatively short with 2 sentences on
average, ranging from 1 to 5 sentences. For evaluation purposes, the length of document summaries
generated by our summarization system will vary, following the length of ground truth summaries for the
corresponding documents. For example, if a ground truth summary for a document has 3 sentences, then our
summarization system will also produce a 3-sentence summary for that document.
One of the conveniences provided by Liputan6 dataset is that the tokenization has been done on
each sentence of the news document, and each sentence has also been tokenized into words. This makes us
easier to preprocess the data. A preprocessing that we performed on the dataset is only case-folding. Figure 1
illustrates an example of a document in the Liputan6 dataset that is used in this work (see the right column of
the table). We can see that the documents in the Liputan6 dataset are in the form of a list of sentences, where
each sentence is in the form of a list of words. It also does not include the document titles. The original
document is presented in the left column of the table. Note that the English translation is not a part of the
dataset, but we display it in the table to help readers understand the content of this document. This document
example consists of nine sentences, and its ground truth summary consists of two sentences.
Figure 1. A sample of text in Liputan6 dataset (right) and its original text in the Liputan6 website (left)
4. Int J Elec & Comp Eng ISSN: 2088-8708
Enhanced TextRank using weighted word embedding for text summarization (Evi Yulianti)
5475
We also use another dataset to train the embedding models from which the word embeddings are
extracted. A detailed explanation of the word embedding methods used in this work is explained in section
2.3.1. To train Word2Vec and FastText embedding models, we use a dump of the Indonesian Wikipedia on
September 20th
, 2022 (idwiki-20220920-pages-articles.xml.bz2) which consists of 3,309,595 pages (according
to the page count statistic presented in the idwiki-20220920-site stats.sql.gz) or 135,678,726 words. For BERT
embedding, we use a pertained language model BERT [32] that was trained on a large-scale Indonesian corpus,
i.e., Indo4B dataset which consists of a variety of data, such as Wikipedia pages, news articles, and tweets, with
a total of 3,581,301,476 words. This model was called IndoBERT in Wilie et al. [32].
2.2. System framework
Figure 2 illustrates the flow of the process in our summarization system. Initially, embedding
models are learned using a large corpus. For Wod2Vec and FastText models, we train them using Indonesian
Wikipedia dataset. For the BERT model, we use Indonesian BERT-based pre-trained models from previous
work, i.e., IndoBERT [32].
Figure 2. The framework of our enhanced TextRank summarization system
When a document to be summarized is inputted into our system, at first it needs to be split into
sentences (s). Then, word embedding for each word in the sentences is generated from the embedding
models. The red box in the figure highlights a slightly different process in generating word embedding using
Word2Vec/FastText and BERT models. The dashed blue arrows inside the red box indicate conditional flow.
When we want to generate word embedding using Word2Vec/FastText, then we follow the top dashed arrow.
Otherwise, we follow the bottom dashed arrow when we want to generate BERT word embedding.
Because Word2Vec and FastText are traditional word embedding, there is only one static vector
representation for each word. Therefore, each sentence needs to be split into words first. Then, given a word (𝑤),
the Word2Vec/FastText model will generate a word embedding or word vector (𝑤𝑣). This is different from
BERT which is a contextualized word embedding, so it considers context from the right and left word sequences
when computing a word embedding. As a result, the model will give different embedding for a word depending
on its context, therefore, the word embedding for any word is not static. Since the model needs context, then we
directly input a sentence into the model to produce word embedding for each word in the sentence.
Further, TF-IDF weighting is computed using our Liputan6 dataset to assign a weight for each word
based on their occurrence statistics in the dataset. The TF-IDF scores are used to weigh the word embedding
generated earlier and then produce the weighted word embedding or the weighted word vector (𝑤𝑤𝑣). Next,
sentence vectors (𝑠𝑣) are formed by taking the average weighted word vectors that compose the sentences.
After sentence vectors are obtained, cosine similarity between any two sentences in the document is
computed to build a sentence similarity matrix. TextRank summarization algorithm is then applied to score
the sentences based on their importance among all other sentences in the document using graph-based
method. To generate a summary, a top-k sentences with the highest scores will be extracted and then ordered
according to the order of sentences in the document.
In general, our summarization system contains three main components. They are word embedding,
TF-IDF weighting, and TextRank summarization algorithm. More detailed explanation about these
components is given in the following subsections.
2.2.1. Word embedding
Word embedding concept was initially introduced by Bengio et al. [34] as a distributed
representation of words, obtained after jointly training word embeddings with a model’s parameter in a
5. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5472-5482
5476
neural network model, that can preserve the semantic and syntactic relationship between words. Word
embedding represents a word as a vector learned from a large collection of data, in which words that have
similar meanings (semantically similar) will have vectors that are close in the vector space. Word embedding
is generally more effective than classic bag-of-words vector representation because of its ability to capture
the semantic relationship between words. In addition, since word embedding has dense representation, then it
also solves the sparsity problem in the classic representation, which makes the computation more effective
and efficient. Therefore, most of the current research in text processing uses word embedding as the word
representation [35]–[38]. There are three variants of word embedding that are explored in this work:
Word2Vec, FastText, and BERT. They are detailed in the following.
a) Word2Vec
Mikolov et al. [29] introduced two neural architectures to learn Word2Vec word embeddings: the
continuous skip-gram model and the continuous bag of words (CBOW) model. Both models learn vector
representations of words that can capture semantic information as well as linguistic regularities and patterns.
CBOW works on the task of predicting the current word 𝑤(𝑡) based on context words 𝑤(𝑡 ± 𝑖), where
𝑖 = 1,2,3, … , 𝑛, with n denotes the window size. On the contrary, Skip-gram works on the task of predicting
context words 𝑤(𝑡 ± 𝑖) based on current word 𝑤(𝑡). Window size is one of the hyperparameters in the
Word2Vec algorithm, which can be set during the training process, that indicates the maximum distance
between the current and predicted words within a sentence. The weights in the neural network learned during
the training process will serve as the elements of word embedding. Mikolov et al. [29] found that the skip-
gram model can capture semantic relationships better than the CBOW model. Therefore, the skip-gram
model is chosen in this work to build the Word2Vec model.
We use a Python library Gensim to implement the Word2Vec algorithm and train the corresponding
embedding model using Indonesian Wikipedia dataset. We use Gensim’s default value for the
hyperparameter values to build the Word2Vec model, as follows: learning 𝑟𝑎𝑡𝑒 = 0.025, 𝑒𝑝𝑜𝑐ℎ = 5, and
window 𝑠𝑖𝑧𝑒 = 5. We train the Word2Vec model using Indonesian Wikipedia dataset because it is a large
dataset that is publicly available. Using the same reason, there are also many previous researchers who also
learned their Word2Vec models using Wikipedia dataset [39], [40].
b) FastText
FastText is an extension of Word2Vec embedding proposed by Bojanowski et al. [30] that takes into
account the morphology of words (i.e., subwords). In the original Word2Vec embedding, every word has its
word embedding that is generated without considering the similarity in the word morphology. So, it is
possible that two morphologically similar words do not have similar word representations. This is considered
a drawback of Word2Vec for a language that is rich in morphology [30]. FastText operates at a more
granular level which is at the character n-gram level, while Word2Vec works at the word level. Character
n-gram is a sequence of n characters within a given character window. In FastText, a word is represented by
a sum of its character n-grams. Therefore, word embedding is obtained by summing up the vector
representation of its character n-grams. For example, Indonesian word “melihat” (English translation: “see”)
with a window size of 3 for example, will be split up into seven subwords (character 3-gram): “<me”, “mel”,
“eli”, “lih”, “iha”, “hat”, “at>”, and the word embedding for that word is the sum of vectors for those seven
subwords.
Using the above mechanism, FastText can learn representations for morphologically rich languages
better. Words that have similar morphology (sharing many overlapping character n-grams) will be closer in
vector space. In addition, it also enables rare words to be represented appropriately by taking the summation
of embedding of its subwords. As a result, we can infer the embedding of unseen words that do not exist in
the training corpus, which therefore helps to tackle the out of vocabulary (OOV) issue. Similar to Word2Vec
implementation, we also use a Python library Gensim and Skip-gram architecture to implement the FastText
algorithm and train the corresponding embedding model using Indonesian Wikipedia dataset.
c) Bidirectional encoder representations from transformers
BERT is a contextualized language model that was first introduced by Devlin et al. [31]. BERT
architecture contains a stack of encoders from the transformer model [41], that learns the word context of a
language in a bidirectional way. While Word2Vec and FastText belong to a traditional word embedding in
which one word will always have one embedding, no matter what words exist before or after it, BERT
belongs to the contextualized word embedding in which the word embedding is assigned by looking at the
context around that word. Therefore, it is designed to better understand the contextual relationship between
words. BERT model can be fine-tuned to directly solve the downstream tasks. In this work, however, BERT
is only used in a feature-based approach to extract word embedding to be incorporated with the TextRank
algorithm. The performance will then be compared against Word2Vec and FastText models.
Recently, BERT was pre-trained on a large-scale of Indonesian corpus consisting of around 4 billion
words, referred to as IndoBERT [32]. IndoBERT has some variants that differ in the model architecture, such
6. Int J Elec & Comp Eng ISSN: 2088-8708
Enhanced TextRank using weighted word embedding for text summarization (Evi Yulianti)
5477
as the number of encoder layers, the number of hidden layers, and the number of attention heads. Two main
variants of IndoBERT that will be investigated in this work are IndoBERTbase and IndoBERTlarge. Table 2
highlights the differences in the architecture between IndoBERTbase and IndoBERTlarge.
Table 2. The architecture of IndoBERTbase and IndoBERTlarge models
IndoBERT Model #Layers Hidden size #Attention heads Vocab size #Parameters
IndoBERTbase 12 768 12 30,522 124.5M
IndoBERTlarge 24 1,024 16 30,522 335.2M
It appears from the table that IndoBERTlarge has twice more encoders and almost three times more
neural network parameters than IndoBERTbase. To examine to what extent this difference will affect the
effectiveness of the resulting word embedding, we compare the results of our summarization system using
contextualized embeddings from these two architectures. To use pre-trained IndoBERT models, we use
library transformer in the hugging face website that provides application programming interface (APIs) and
tools for downloading and training some state-of-the-art pre-trained language models.
2.2.2. TF-IDF weighting
TF-IDF is an attempt to give weight to words contained in a text based on their occurrence in the
collection. In this work, TF-IDF score is used to weigh the word embeddings (word vectors). In general,
TF-IDF estimates how important words are in a document, while also considering their importance in the
collection. The formula to compute TF and IDF scores are (1) and (2),
𝑡𝑓(𝑡, 𝑑) =
𝑓(𝑡,𝑑)
∑ 𝑓(𝑡
̂,𝑑)
𝑡
̂∈𝑑
(1)
𝑖𝑑𝑓(𝑡) =
|𝐶|
|𝑑∈𝐶:𝑡∈𝑑|
(2)
where 𝑓𝑡,𝑑 represents the term frequency (TF) of a word 𝑡 in a document 𝑑, and ∑ 𝑓(𝑡
̂,𝑑)
𝑡
̂∈𝑑 indicates the
document length, i.e., the total terms in the document 𝑑. The higher the number of occurrences of a term in a
document, the more important that word in the document. Then, 𝑖𝑑𝑓(𝑡) represents the inverse document frequency
(IDF) of a word 𝑡 in the collection 𝐶, where |𝐶| is the number of documents in the collection and |𝑑 ∈ 𝐶: 𝑡 ∈ 𝑑| is
the number of documents in the collection that contain a word 𝑡. The more documents in a collection containing a
particular word, the more general the word is. To sum up, words that often appear in a document, but do not often
appear in the corpus will be assigned a high TF-IDF weight, indicating that they are important words.
2.2.3. TextRank
TextRank is a graph-based ranking algorithm for scoring the text [8], where the text unit can be
specified, such as keywords or sentences. TextRank is an extension of PageRank [42] which was originally
aimed for webpage ranking. In this paper, TextRank is used for sentence ranking purposes. The relationship
between sentences in the document is a factor that contributes to the score given by TexRank to each
sentence. To calculate the relationship between sentences, we use cosine similarity, following
Barrios et al. [24]. The resulting pairwise cosine similarity scores between sentences are then used to
construct a similarity matrix that serves as a basis to form a weighted graph of sentences.
In the weighted graph representation, a vertex represents a sentence, and an edge between two
vertexes represents the relationship/connection between two sentences. The similarity score is used as the
weight of the edge between those two sentences. Two sentences that are more similar will have a higher
weight on the edge connecting them. Based on this graph representation, we then apply TextRank algorithm
to score each sentence in the document using (3),
𝑆(𝑉𝑖) = (1 − 𝑑) + 𝑑 ∗ ∑
𝑤𝑗𝑖
∑ 𝑤𝑗𝑘
𝑣𝑘 ∈ 𝑂𝑢𝑡(𝑉𝑗)
∗ 𝑆(𝑉
𝑗)
𝑉𝑗 ∈ 𝐼𝑛(𝑉𝑖) (3)
where 𝑉𝑖 is the 𝑖-th vertex and 𝑆(𝑉𝑖) denotes the score for the 𝑖-th vertex, which basically represents the
sentence score. 𝐼𝑛(𝑉𝑖) and 𝑂𝑢𝑡(𝑉
𝑗) respectively denotes the set of vertexes that go into 𝑉𝑖 and the set of
vertexes that go out from 𝑉
𝑗. 𝑑 is a damping factor, which is the probability of randomly moving from one
vertex to another vertex in the graph. 𝑤𝑗𝑖 and 𝑤𝑗𝑘 respectively denotes the weight of edge between sentences
𝑗 and 𝑖, and the weight of edge between sentences 𝑗 and 𝑘.
7. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5472-5482
5478
TextRank is an iterative algorithm that will stop the iteration when convergence is achieved. It is
when there is a sentence in the document whose the difference between its scores in two successive iterations
is already very small so we can assume that in the next iterations the scores will not significantly change
anymore. When convergence is achieved, all sentences in a document have final scores. We will then take N
sentences with the highest scores as a summary. Since the length of the summary in our ground truth dataset
is varied, then N is adjusted with the actual length of the summary for a document. In our implementation, we
use Python library sklearn to compute the cosine similarity between sentences. Then, the Python library
NetworkX is used to implement the TextRank algorithm. The damping factor is set to 0.85, following the
implementation of original PageRank [42] and TextRank [8]. For an original TextRank that is used as a baseline
method in our experiments, the word representation uses a vector of bag-of-words as in the original paper [8].
For our summarization methods, we use word embedding from Word2Vec, FastText, and IndoBERT as word
representations.
2.3. Evaluation method
The evaluation of our experiment results is performed using recall-oriented understudy for gisting
evaluation (ROUGE) metric [43], which is a common metric for extractive text summarization. In general,
ROUGE calculates the word overlap between automatic summaries and ground truth summaries. The types of
ROUGE to be used are ROUGE-N and ROUGE-L. While ROUGE-N counts the number of n-grams
overlapping between two summaries, ROUGE-L counts the LCS (longest common subsequence) between two
summaries. We choose F-1 score as the type of measurement for our ROUGE scores calculation since it considers
both Precision and Recall. A Python library rouge_score is used to calculate ROUGE scores for our
summaries.
3. RESULTS AND DISCUSSION
3.1. The effectiveness of enhanced TextRank using (unweighted) word embedding
Table 3 describes the performance of our TextRank-based summarization systems using
(unweighted) word embedding from Word2Vec, FastText, and IndoBERT. These systems are compared to
the original TextRank method as the baseline. We can see that the ROUGE scores of all systems using word
embedding are higher than the baseline system. Here, the use of word embedding is shown to significantly
increase the performance of the original TextRank algorithm.
Table 3. Performance comparison between systems using (unweighted) word embedding
Summarization System ROUGE-1 ROUGE-2 ROUGE-L
TextRank 0.3479 0.2339 0.3302
TextRank+Word2Vec 0.3822∗ 0.2693∗ 0.3646∗
TextRank+FastText 0.3776∗ 0.2644∗ 0.3598∗
TextRank+IndoBERTbase 0.3929∗+×
0.2837∗+×
0.3768∗+×
TextRank+IndoBERTlarge 0.3940∗+×
0.2854∗+×
0.3779∗+×
Symbols *, +, ×, and ÷ denote significant differences against TextRank, TextRank + Word2Vec,
TextRank + FastText, and TextRank + IndoBERTbase methods, respectively, according to the
paired t-test (p <0.05).
It also appears from the table that the systems using IndoBERT are significantly more effective than
those using Word2Vec and FastText. This confirms one of our contributions in this work which is using the
contextualized word embedding BERT in the TextRank algorithm. The advantage possessed by BERT occurs
because its word embeddings are generated by looking at the context of the surrounding words (this is
different from Word2Vec and FastText models that adopt static word representation as it does not consider
the context when generating word embedding). As a result, the word embedding from BERT is more accurate
to capture semantic relationships between words. A better word representation then results in a better
sentence representation, which further contributes to a better estimate of sentence importance by the
TextRank method. This explains the success of BERT embedding in enhancing the performance of the
TextRank method the most.
TextRank using IndoBERTlarge gains slightly higher ROUGE scores than that using
IndoBERTbase, and this result is consistent with the findings reported in previous work [44]. The statistical
test, however, shows that this difference is not significant. This slight superiority of IndoBERTlarge over
IndoBERTbase is affected by the higher number of layers as well as parameters used in its neural model,
which makes it slightly better at capturing the meanings of words.
8. Int J Elec & Comp Eng ISSN: 2088-8708
Enhanced TextRank using weighted word embedding for text summarization (Evi Yulianti)
5479
3.2. The effectiveness of enhanced TextRank using weighted word embedding
Table 4 shows the performance of the enhanced TextRank methods using weighted word
embedding. Compared to the results using unweighted word embedding presented in Table 3 earlier, it is
clear that the application of TF-IDF weighting to the word embedding gives significant performance
increases to all summarization systems. This increase can occur because by using weighted word embedding,
the importance of words in the collection is considered when making the sentence representation.
Consequently, it improves the generated sentence vectors which then results in enhancing the performance of
TextRank method to estimate the importance of sentences.
Table 4. Performance comparison between systems using weighted word embedding
Summarization System ROUGE-1 ROUGE-2 ROUGE-L
TextRank 0.3479 0.2339 0.3302
TextRank+Word2Vec+TF-IDF 0.4082‡∗ 0.3041‡∗ 0.3926‡∗
TextRank+FastText+TF-IDF 0.4042‡∗ 0.2982‡∗ 0.3879‡∗
TextRank+IndoBERTbase+TF-IDF 0.4039‡∗ 0.2974‡∗ 0.3880‡∗
TextRank+IndoBERTlarge+TF-IDF 0.4044‡∗ 0.2982‡∗ 0.3884‡∗
Symbol ‡ illustrates the significant difference against the corresponding methods without using TF-IDF
weighting, as displayed in Table 3. Then, symbols *, +, ×, and ÷ denote significant differences against
TextRank, TextRank + Word2Vec + TF-IDF, TextRank + FastText + TF-IDF, and TextRank + IndoBERT-
based-methods + TF-IDF, respectively, according to the paired t-test (p <0.05)
The increased levels of systems performance over those without using weighted word embedding are
2.64% to 7.04% in ROUGE-1, 4.48% to 12.92% in ROUGE-2, and 2.78% to 7.81% in ROUGE-L. Systems that
are benefited the most from the TF-IDF weighting are those using the Word2Vec and FastText models, which
respectively achieve 12.92% and 12.78% increases in ROUGE-2. For systems using IndoBERTbase and
IndoBERTlarge models, they only gain 4.83% and 4.48% increases in ROUGE-2, respectively. Here, we can see
that the performance increases obtained by the systems using Word2Vec and FastText models are almost
three times higher compared to those using the BERT models. We analyze that this could happen because the
TF-IDF weighting can help to reduce the limitation of static word embeddings generated by Word2Vec and
FastText, which do not see the contextual meaning of a word. Therefore, the TF-IDF weighting can give
extra information on the importance of words in the collection to the word representation. In addition, the
way TF-IDF works is almost the same as word embedding from Word2Vec and FastText, where each word
has only one TF-IDF weight no matter what words occur before or after it as it does not consider the context.
Consequently, it is more effective for Word2Vec and FastText, rather than BERT.
3.3. More about the summary results
We analyze some examples of the summaries generated using our systems and compare them with
ground truth summaries. Table 5 shows an example of the results of our summaries generated for a document
with id “16,379” in our dataset using TextRank and (unweighted/weighted) word embedding based on
Word2Vec, FastText, and BERT. The ground truth summary as well as the baseline summary (TextRank) for
this document are also displayed in the table for comparison. The text printed in boldface indicates the text
that exists in the ground truth summary. The more the text in an automatically generated summary is bolded,
the better the summary is, since it indicates that it is more similar to the ground truth summary.
It appears that the summary generated using the baseline method contains only one sentence (out of
two sentences) from the ground truth summary. The result is the same as those of TextRank+Word2Vec and
TextRank+FastText systems. In this case, the use of unweighted word embedding from Word2Vec and
FastText is unable to improve the baseline method, i.e., TextRank.
However, after adding the TF-IDF weighting to the word embedding from TextRank and
Word2Vec, it succeeds to enhance the TextRank method, which enables the summaries generated by these
systems to be the same as the ground truth summary. This example demonstrates that weighted word
embedding is effective for Word2Vec and FastText, as it can improve the performance of unweighted word
embedding. Meanwhile, the use of the IndoBERTbase and IndoBERTlarge word embeddings, even without
TF-IDF weighting, has succeeded to produce summaries that are the same as the ground truth summary.
Further using the weighted word embedding does not change the summary results for IndoBERT-based
systems. This confirms our results in the earlier subsection that since IndoBERT considers the context to
produce word embedding, then using the original word embedding can already capture the semantics of
words well. Therefore, the effect of TF-IDF weighting for this method is smaller compared to that for static
word embeddings (Word2Vec and FastText).
9. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5472-5482
5480
Table 5. Example of summaries for a document id “16379”
System Summaries
Ground Truth Rakernas juga akan meminta penjelasan Fraksi PDI-P di DPR tentang proses Memorandum I, II,
hingga menjelang SI MPR. Rakernas juga akan menanyakan kesiapan F-PDIP MPR menghadapi
sidang istimewa.
(English translation: The National Working Meeting will also ask for an explanation from the PDI-P
faction in the People’s Representative Council regarding the process of Memorandums I, II, up to
the time before the people’s consultative assembly (PCA) Special Session. The National Working
Meeting will also ask about the PCA’s F-PDIP readiness for a special session.)
TextRank (baseline) Di antaranya, membicarakan soal situasi politik terkini, termasuk pergelaran Sidang Istimewa
MPR, 1 Agustus mendatang. Rakernas juga akan meminta penjelasan Fraksi PDI-P di DPR tentang
proses Memorandum I, II, hingga menjelang SI MPR.
(ROUGE-1=0.6875, ROUGE-2=0.6129, ROUGE-L=0.6875)
(English translation: Among other things, discussing the current political situation, including the
holding of the MPR Special Session, 1 August. The National Working Meeting will also ask for an
explanation from the PDI-P faction in the People’s Representative Council regarding the process of
Memorandums I, II, up to the time before the PCA Special Session.)
TextRank+Word2Vec same as the summary of TexRank above
TextRank+FastText same as the summary of TexRank above
TextRank+
IndoBERTbase
Rakernas juga akan meminta penjelasan Fraksi PDI-P di DPR tentang proses Memorandum I, II,
hingga menjelang SI MPR. Rakernas juga akan menanyakan kesiapan F-PDIP MPR menghadapi
sidang istimewa.
(ROUGE-1=1, ROUGE-2=1, ROUGE-L=1)
(English translation: The National Working Meeting will also ask for an explanation from the PDI-P
faction in the People’s Representative Council regarding the process of Memorandums I, II, up to the
time before the PCA Special Session. The National Working Meeting will also ask about the PCA’s F-
PDIP readiness for a special session.)
TextRank+
IndoBERTlarge
same as the summary of TexRank+IndoBERTbase above
TextRank+Word2Vec+
TF-IDF
same as the summary of TexRank+IndoBERTbase above
TextRank+FastText+
TF-IDF
same as the summary of TexRank+IndoBERTbase above
TextRank+
IndoBERTbase+TF-IDF
same as the summary of TexRank+IndoBERTbase above
TextRank+
IndoBERTlarge+TF-IDF
same as the summary of TexRank+IndoBERTbase above
4. CONCLUSION
We propose to use weighted word embedding to enhance TextRank algorithm for text
summarization on Indonesian news dataset. A variation of word embeddings, such as traditional word
embedding (Word2Vec & FastText) and contextualized word embedding BERT, combined with TF-IDF
weighting are exploited in our methods to be incorporated with the TextRank. The effect of the weighted
word embedding (as well as unweighted word embedding) on the performance of TextRank summarization
method are examined in this work.
Our experimental results show that the use of (unweighted) word embedding improves the
performance of the TextRank method by up to 13.25% in ROUGE-1 and up to 22.02% in ROUGE-2. The
system using word embedding from BERT is shown to achieve the highest performance increase. BERT-
based summarization systems gain 5.98% and 7.94% higher ROUGE-2 scores as compared to the systems
using Word2Vec and FastText, respectively. This indicates that BERT word embedding, which is generated
by considering the context of the surrounding words, produces better word representation than Word2Vec
and FastText, which further results in better estimation of sentence importance in TextRank. When each
word embedding is weighed using TF-IDF score, we found that the performance for all systems using
unweighted word embedding significantly improve. However, BERT-based systems only gain a little
improvement (with 2.64% to 4.83% increase), while the biggest improvement is achieved by the systems
using static word embeddings, i.e., Word2Vec (with 6.80% to 12.92% increase) and FastText (with 7.04% to
12.78% increase). Overall, our systems using weighted word embedding can outperform the original
TextRank method by up to 17.33% in ROUGE-1 and 30.01% in ROUGE-2. This shows that our proposed
methods succeeded to give a significant improvement over the original TextRank method.
ACKNOWLEDGEMENT
This research was funded by the Directorate of Research and Development, Universitas Indonesia,
under Hibah PUTI Q2 2022 (Grant No. NKB-571/UN2.RST/HKP.05.00/2022).
10. Int J Elec & Comp Eng ISSN: 2088-8708
Enhanced TextRank using weighted word embedding for text summarization (Evi Yulianti)
5481
REFERENCES
[1] S. Kemp, “Digital 2022: Indonesia,” DataReportal, 2022. https://datareportal.com/reports/digital-2022-indonesia (accessed Jul.
31, 2022).
[2] R. D. Lestari, “Shifting journalistic ethics in the internet age, case study: Violation of journalistic ethics in journalistic products
and journalist behavior in online media,” Komunikator, vol. 11, no. 2, 2019, doi: 10.18196/jkm.112027.
[3] N. Kurniasih, “Reading habit in digital era: Indonesian people do not like reading, is it true?,” in INA-Rxiv Papers, 2017, pp. 1–4.
[4] H. P. Luhn, “The automatic creation of literature abstracts,” IBM Journal of Research and Development, vol. 2, no. 2,
pp. 159–165, Apr. 1958, doi: 10.1147/rd.22.0159.
[5] D. R. Radev, E. Hovy, and K. McKeown, “Introduction to the special issue on summarization,” Computational Linguistics,
vol. 28, no. 4, pp. 399–408, 2002, doi: 10.1162/089120102762671927.
[6] A. Pai, “Text summarizer using abstractive and extractive method,” International Journal of Engineering Research and
Technology (IJERT), vol. 3, no. 5, pp. 971–975, 2014.
[7] A. Nenkova, “Automatic summarization,” Foundations and Trends®in Information Retrieval, vol. 5, no. 2, pp. 103–233, 2011,
doi: 10.1561/1500000015.
[8] R. Mihalcea and P. Tarau, “TextRank: bringing order into text,” in Proceedings of the 2004 Conference on Empirical Methods in
Natural Language Processing, Jul. 2004, pp. 404–411.
[9] F. Koto, J. H. Lau, and T. Baldwin, “Liputan6: A large-scale Indonesian dataset for text summarization,” in Proceedings of the 1st
Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint
Conference on Natural Language Processing, 2020, pp. 598–608.
[10] B. Balcerzak, W. Jaworski, and A. Wierzbicki, “Application of TextRank algorithm for credibility assessment,” in 2014
IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014, vol. 1,
pp. 451–454, doi: 10.1109/WI-IAT.2014.70.
[11] R. Rani and D. K. Lobiyal, “A weighted word embedding based approach for extractive text summarization,” Expert Systems with
Applications, vol. 186, 2021, doi: 10.1016/j.eswa.2021.115867.
[12] S. Sumana, “Towards automatically generating release notes using extractive summarization technique,” in Proceedings of the
International Conference on Software Engineering and Knowledge Engineering, SEKE, Jul. 2021, pp. 241–248, doi:
10.18293/SEKE2021-119.
[13] D. Jain, M. D. Borah, and A. Biswas, “Fine-tuning Textrank for legal document summarization: a Bayesian optimization based
approach,” in ACM International Conference Proceeding Series, 2020, pp. 41–48, doi: 10.1145/3441501.3441502.
[14] T. T. Hailu, J. Yu, and T. G. Fantaye, “A framework for word embedding based automatic text summarization and evaluation,”
Information, vol. 11, no. 2, Jan. 2020, doi: 10.3390/info11020078.
[15] A. Kazemi, V. Pérez-Rosas, and R. Mihalcea, “Biased TextRank: Unsupervised graph-based content extraction,” in Proceedings
of the 28th International Conference on Computational Linguistics, 2020, pp. 1642–1652, doi: 10.18653/v1/2020.coling-
main.144.
[16] S. Gupta, A. Sharaff, and N. K. Nagwani, “Biomedical text summarization: a graph-based ranking approach,” in Advances in
Intelligent Systems and Computing, Springer Singapore, 2022, pp. 147–156, doi: 10.1007/978-981-16-2008-9_14.
[17] U. Barman, V. Barman, M. Rahman, and N. K. Choudhury, “Graph based extractive news articles summarization approach
leveraging static word embeddings,” in 2021 International Conference on Computational Performance Evaluation (ComPE),
2021, pp. 8–11, doi: 10.1109/ComPE53109.2021.9752056.
[18] X. Zuo, S. Zhang, and J. Xia, “The enhancement of TextRank algorithm by using Word2Vec and its application on topic
extraction,” Journal of Physics: Conference Series, vol. 887, no. 1, 2017, doi: 10.1088/1742-6596/887/1/012028.
[19] S. Zaware, D. Patadiya, A. Gaikwad, S. Gulhane, and A. Thakare, “Text summarization using TF-IDF and TextRank algorithm,”
in 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI), Jun. 2021, pp. 1399–1407, doi:
10.1109/ICOEI51242.2021.9453071.
[20] X. Guan, Y. Li, Q. Zeng, and C. Zhou, “An automatic text summary extraction method based on improved TextRank and TF-
IDF,” IOP Conference Series: Materials Science and Engineering, vol. 563, no. 4, Jul. 2019, doi: 10.1088/1757-
899X/563/4/042015.
[21] X. Zheng, T. Zhou, Y. Wang, and S. Li, “An improved TextRank-based method for chinese text summarization,” in Lecture Notes
in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),
vol. 13339, Springer International Publishing, 2022, pp. 140–149, doi: 10.1007/978-3-031-06788-4_12.
[22] T. Tang, T. Yuan, X. Tang, and D. Chen, “Incorporating external knowledge into unsupervised graph model for document
summarization,” Electronics, vol. 9, no. 9, Sep. 2020, doi: 10.3390/electronics9091520.
[23] S. Yu, J. Su, P. Li, and H. Wang, “Towards high performance text mining: A TextRank-based method for automatic text
summarization,” International Journal of Grid and High Performance Computing, vol. 8, no. 2, pp. 58–75, Apr. 2016, doi:
10.4018/IJGHPC.2016040104.
[24] F. Barrios, F. López, L. Argerich, and R. Wachenchauzer, “Variations of the similarity function of TextRank for automated
summarization,” Prepr. arXiv.1602.03606, Feb. 2016.
[25] K. Tamara and N. Milicevic, “Comparing sentiment analysis and document representation methods of amazon reviews,” in 2018
IEEE 16th International Symposium on Intelligent Systems and Informatics (SISY), Sep. 2018, pp. 000283–000286, doi:
10.1109/SISY.2018.8524814.
[26] D. E. Cahyani and I. Patasik, “Performance comparison of TF-IDF and Word2Vec models for emotion text classification,”
Bulletin of Electrical Engineering and Informatics (BEEI), vol. 10, no. 5, pp. 2780–2788, 2021, doi: 10.11591/eei.v10i5.3157.
[27] V. Kalra, I. Kashyap, and H. Kaur, “Improving document classification using domain-specific vocabulary: Hybridization of deep
learning approach with TFIDF,” International Journal of Information Technology, vol. 14, no. 5, pp. 2451–2457, 2022, doi:
10.1007/s41870-022-00889-x.
[28] S. Akuma, T. Lubem, and I. T. Adom, “Comparing bag of words and TF-IDF with different models for hate speech detection
from live tweets,” International Journal of Information Technology, vol. 14, no. 7, pp. 3629–3635, 2022, doi: 10.1007/s41870-
022-01096-4.
[29] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” 1st International
Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, Jan. 2013.
[30] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the
Association for Computational Linguistics, vol. 5, pp. 135–146, 2017, doi: 10.1162/tacl_a_00051.
[31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language
understanding,” Prepr. arXiv.1810.04805, Oct. 2018.
11. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5472-5482
5482
[32] B. Wilie et al., “IndoNLU: benchmark and resources for evaluating Indonesian natural language understanding,” in Proceedings
of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International
Joint Conference on Natural Language Processing, 2020, pp. 843–857.
[33] J. You, C. Hu, H. Kamigaito, H. Takamura, and M. Okumura, “Abstractive document summarization with word embedding
reconstruction,” in Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural
Language Processing Methods and Applications, 2021, pp. 1586–1596, doi: 10.26615/978-954-452-072-4_178.
[34] Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, and J.-L. Gauvain, “A neural probabilistic language models,” in Innovations in
Machine Learning, Berlin/Heidelberg: Springer-Verlag, 2006, pp. 137–186, doi: 10.1007/3-540-33486-6_6.
[35] T. V Rampisela and E. Yulianti, “Semantic-based query expansion for academic expert finding,” in 2020 International
Conference on Asian Language Processing (IALP), 2020, pp. 34–39, doi: 10.1109/IALP51396.2020.9310492.
[36] E. Yulianti, A. Kurnia, M. Adriani, and Y. S. Duto, “Normalisation of Indonesian-English code-mixed text and its effect on
emotion classification,” International Journal of Advanced Computer Science and Applications, vol. 12, no. 11, pp. 674–685,
2021, doi: 10.14569/IJACSA.2021.0121177.
[37] E. Yulianti, R.-C. Chen, F. Scholer, and M. Sanderson, “Using semantic and context features for answer summary extraction,” in
Proceedings of the 21st Australasian Document Computing Symposium, 2016, pp. 81–84, doi: 10.1145/3015022.3015031.
[38] T. V Rampisela and E. Yulianti, “Academic expert finding in Indonesia using word embedding and document embedding: A case
study of Fasilkom UI,” in 2020 8th International Conference on Information and Communication Technology (ICoICT), Jun.
2020, pp. 1–6, doi: 10.1109/ICoICT49345.2020.9166249.
[39] D. Jatnika, M. A. Bijaksana, and A. A. Suryani, “Word2Vec model analysis for semantic similarities in English words,” Procedia
Computer Science, vol. 157, pp. 160–167, 2019, doi: 10.1016/j.procs.2019.08.153.
[40] M. Al-Hajj and M. Jarrar, “LU-BZU at SemEval-2021 task 2: Word2Vec and Lemma2Vec performance in Arabic word-in-
context disambiguation,” in Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), 2021,
pp. 748–755, doi: 10.18653/v1/2021.semeval-1.99.
[41] A. Vaswani et al., “Attention is all you need,” Prepr. arXiv.1706.03762, Jun. 2017.
[42] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Computer Networks and ISDN Systems,
vol. 30, no. 1–7, pp. 107–117, Apr. 1998, doi: 10.1016/S0169-7552(98)00110-X.
[43] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Association for Computational Linguistics, 2004,
pp. 74–81.
[44] N. Rachmawati and E. Yulianti, “Transfer learning for closed domain question answering in COVID-19,” International Journal
of Advanced Computer Science and Applications, vol. 13, no. 12, pp. 277–285, 2022, doi: 10.14569/IJACSA.2022.0131234.
BIOGRAPHIES OF AUTHORS
Evi Yulianti is a lecturer and researcher at the Faculty of Computer Science,
Universitas Indonesia. She received the B.Comp.Sc. degree from the Universitas Indonesia in
2010, the dual M.Comp.Sc. degree from Universitas Indonesia and Royal Melbourne Institute
of Technology University in 2013, and the Ph.D. degree from Royal Melbourne Institute of
Technology University in 2018. Her research interests include information retrieval and
natural language processing. She can be contacted at email evi.y@cs.ui.ac.id.
Nicholas Pangestu received his bachelor’s degree in computer science from the
Universitas Indonesia in 2023. His research interests are related to machine learning and
natural language processing. He can be contacted at email pangestu.nicholas@gmail.com.
Meganingrum Arista Jiwanggi is a lecturer and researcher at Faculty of
Computer Science, Universitas Indonesia. She received her bachelor’s degree in computer
science from the Universitas Indonesia in 2012 and a dual master of computer science degree
from the Universitas Indonesia and Royal Melbourne Institute of Technology University in
2014. Her research interests include natural language processing and data science. She can be
contacted at email meganingrum@cs.ui.ac.id.