The main focus of this research is to find the reasons behind the fresh cases of COVID-19 from the public’s perception for data specific to India. The analysis is done using machine learning approaches and validating the inferences with medical professionals. The data processing and analysis is accomplished in three steps. First, the dimensionality of the vector space model (VSM) is reduced with improvised feature engineering (FE) process by using a weighted term frequency-inverse document frequency (TF-IDF) and forward scan trigrams (FST) followed by removal of weak features using feature hashing technique. In the second step, an enhanced K-means clustering algorithm is used for grouping, based on the public posts from Twitter®. In the last step, latent dirichlet allocation (LDA) is applied for discovering the trigram topics relevant to the reasons behind the increase of fresh COVID-19 cases. The enhanced K-means clustering improved Dunn index value by 18.11% when compared with the traditional K-means method. By incorporating improvised two-step FE process, LDA model improved by 14% in terms of coherence score and by 19% and 15% when compared with latent semantic analysis (LSA) and hierarchical dirichlet process (HDP) respectively thereby resulting in 14 root causes for spike in the disease.
Enhancing COVID-19 forecasting through deep learning techniques and fine-tuningIJECEIAES
In this study, a comprehensive analysis of classical linear regression forecasting models and deep learning techniques for predicting coronavirus disease of 2019 (COVID-19) pandemic data was presented. Among the deep learning models, the long short-term memory (LSTM) neural network demonstrated superior performance, delivering accurate predictions with minimal errors. The neural network effectively addressed overfitting and underfitting issues through rigorous tuning. However, the diversity of countries and dataset attributes posed challenges in achieving universally optimal predictions. The current study explored the application of the LSTM in predicting healthcare resource demand and optimizing hospital management to provide potential solutions for overcrowding and cost reduction. The results showed the importance of leveraging advanced deep learning techniques for improved COVID-19 forecasting and extending the application of the models to address broader healthcare challenges beyond the pandemic. To further enhance the model performance, future work needed to incorporate additional attributes, such as vaccination rates and immune percentages.
Coronavirus disease situation analysis and prediction using machine learning...IJECEIAES
During a pandemic, early prognostication of patient infected rates can reduce the death by ensuring treatment facility and proper resource allocation. In recent months, the number of death and infected rates has increased more distinguished than before in Bangladesh. The country is struggling to provide moderate medical treatment to many patients. This study distinguishes machine learning models and creates a prediction system to anticipate the infected and death rate for the coming days. Equipping a dataset with data from March 1, 2020, to August 10, 2021, a multi-layer perceptron (MLP) model was trained. The data was managed from a trusted government website and concocted manually for training purposes. Several test cases determine the model's accuracy and prediction capability. The comparison between specific models assumes that the MLP model has more reliable prediction capability than the support vector regression (SVR) and linear regression model. The model presents a report about the risky situation and impending coronavirus disease (COVID-19) attack. According to the prediction produced by the model, Bangladesh may suffer another COVID-19 attack, where the number of infected cases can be between 929 to 2443 and death cases between 19 to 57.
COVID-19 knowledge-based system for diagnosis in Iraq using IoT environmentnooriasukmaningtyas
The importance and benefits of healthcare mobile applications is increasing rapidly, especially when such applications are connected to the internet of things (IoT). This paper describes a smart knowledge-based system (KBS) that helps patients showing symptoms of Influenza verify being infected with Coronavirus, commonly known as COVID-19. In addition to the systems’ diagnostic functionality, it helps these patients get medical assistance fast by notifying medical authorities using the IoT. This system displays patient’s location, phone number, date and time of examination. During the applications’ development, the developers used Twilio, short message service (SMS), WhatsApp, and Google map applications.
An agent-based model to assess coronavirus disease 19 spread and health syst...IJECEIAES
The present pandemic has tremendously raised the health systems’ burden around the globe. It is important to understand the transmission dynamics of the infection and impose localized strategies across different geographies to curtail the spread of the infection. The present study was designed to assess the transmission dynamics and the health systems’ burden of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) using an agent-based modeling (ABM) approach. The study used a synthetic population with 31,738,240 agents representing 90.67 percent of the overall population of Telangana, India. The effects of imposing and lifting lockdowns, nonpharmaceutical interventions, and the role of immunity were analyzed. The distribution of people in different health states was measured separately for each district of Telangana. The spread dramatically increased and reached a peak soon after the lockdowns were relaxed. It was evident that is the protection offered is higher when a higher proportion of the population is exposed to the interventions. ABMs help to analyze grassroots details compared to compartmental models. Risk estimates provide insights on the proportion of the population protected by the adoption of one or more of the control measures, which is of practical significance for policymaking.
Comprehensive study: machine learning approaches for COVID-19 diagnosis IJECEIAES
Coronavirus disease 2019 (COVID-19) is caused a large number of death since has declared as an international pandemic in December 2019, and it is spreading all over the world (more than 200 countries). This situation puts the health organizations in an aberrant demand for urgent needs to develop significant early detection and monitoring smart solutions. Therefore, that new system or solution might be capable to identify COVID-19 quickly and accurately. Nowadays, the science of artificial intelligence (AI), and internet of things (IoT) techniques have an extensive range of applications, it can be initiated a possible solution for early detection and accurate decisions. We believe, combine both of the IoT revolution and machine learning (ML) methods are expected to reshape healthcare treatment strategies to provide smart (diagnosis, treatments, monitoring, and hospitals). This work aims to overview the recent solutions that have been used for early detection, and to provide the researchers a comprehensive summary that contribute to the pandemic control such AI, IoT, cloud, fog, algorithms, and all the dataset and their sources that recently published. In addition, all models, frameworks, monitoring systems, devices, and ideas (in four sections) have been sufficiently presented with all clarifications and justifications. Also, we propose a new vision for early detection based on IoT sensors data entry using 1 million patients-data to verify three proposed methods.
An Overview on the Use of Data Mining and Linguistics Techniques for Building...ijcsit
The usage of Online Social Networks (OSN), such as Facebook and Twitter are becoming more and more
popular in order to exchange and disseminate news and information in real-time. Twitter in particular
allows the instant dissemination of short messages in the form of microblogs to followers. This Survey
reviews literature to explore and examine the usage of how OSNs, such as the microblogging tool Twitter,
can help in the detection of spreading epidemics. The paper highlights significant challenges in the field of
Natural Language Processing (NLP) when using microblog based Early Disease Detection Systems. For
instance, microblogging data is an unstructured collection of short messages (140 characters in Twitter),
with noise and non-standard use of the English language. Hence, research is currently exploring the field
of linguistics in order to determine the semantics of the text and uses data mining techniques in order to
extract useful information for disease spread detection. Furthermore, the survey discusses applications and
existing early disease detection systems based on OSNs and outlines directions for future research on
improving such systems based on a combination of linguistics methods, data mining techniques and
recommendation systems.
Review of Interoperability techniques in data acquisition of wireless ECG dev...iosrjce
Wireless medical devices with enhanced capability operating within Body Area Network (BAN) are
key elements in: telehealth, and remote health monitoring systems.They help to capture process and store vital
signs acquired through sensors and transfer them through communication protocols such as ZigBee and
Bluetooth to aggregators. However, due to the heterogeneity of devices and manufacturers’ proprietary
applications, the devices lack adequate interoperability between each other and Health Information Systems
(HIS), thus making it difficult to sort data and/ or make it usable to medical team. In this study we reviewed the
various approaches employed to facilitate interoperability in wearable wireless Electrocardiogram (ECG)
devices ranging from model driven interoperability (Information model, data models etc.); ECG format;
ontology; standards and mapping of physiological data, available in literature in the last five years with respect
to advancement in telecommunication. The findings indicate that wearable wireless ECG needs both
retrospective and anticipatory interoperability mechanisms due to advancement in telecommunication
technology especially with the promising fifth generation (5G) technology, to facilitate the transmission of the
right information from the patient to the healthcare provider.
Enhancing COVID-19 forecasting through deep learning techniques and fine-tuningIJECEIAES
In this study, a comprehensive analysis of classical linear regression forecasting models and deep learning techniques for predicting coronavirus disease of 2019 (COVID-19) pandemic data was presented. Among the deep learning models, the long short-term memory (LSTM) neural network demonstrated superior performance, delivering accurate predictions with minimal errors. The neural network effectively addressed overfitting and underfitting issues through rigorous tuning. However, the diversity of countries and dataset attributes posed challenges in achieving universally optimal predictions. The current study explored the application of the LSTM in predicting healthcare resource demand and optimizing hospital management to provide potential solutions for overcrowding and cost reduction. The results showed the importance of leveraging advanced deep learning techniques for improved COVID-19 forecasting and extending the application of the models to address broader healthcare challenges beyond the pandemic. To further enhance the model performance, future work needed to incorporate additional attributes, such as vaccination rates and immune percentages.
Coronavirus disease situation analysis and prediction using machine learning...IJECEIAES
During a pandemic, early prognostication of patient infected rates can reduce the death by ensuring treatment facility and proper resource allocation. In recent months, the number of death and infected rates has increased more distinguished than before in Bangladesh. The country is struggling to provide moderate medical treatment to many patients. This study distinguishes machine learning models and creates a prediction system to anticipate the infected and death rate for the coming days. Equipping a dataset with data from March 1, 2020, to August 10, 2021, a multi-layer perceptron (MLP) model was trained. The data was managed from a trusted government website and concocted manually for training purposes. Several test cases determine the model's accuracy and prediction capability. The comparison between specific models assumes that the MLP model has more reliable prediction capability than the support vector regression (SVR) and linear regression model. The model presents a report about the risky situation and impending coronavirus disease (COVID-19) attack. According to the prediction produced by the model, Bangladesh may suffer another COVID-19 attack, where the number of infected cases can be between 929 to 2443 and death cases between 19 to 57.
COVID-19 knowledge-based system for diagnosis in Iraq using IoT environmentnooriasukmaningtyas
The importance and benefits of healthcare mobile applications is increasing rapidly, especially when such applications are connected to the internet of things (IoT). This paper describes a smart knowledge-based system (KBS) that helps patients showing symptoms of Influenza verify being infected with Coronavirus, commonly known as COVID-19. In addition to the systems’ diagnostic functionality, it helps these patients get medical assistance fast by notifying medical authorities using the IoT. This system displays patient’s location, phone number, date and time of examination. During the applications’ development, the developers used Twilio, short message service (SMS), WhatsApp, and Google map applications.
An agent-based model to assess coronavirus disease 19 spread and health syst...IJECEIAES
The present pandemic has tremendously raised the health systems’ burden around the globe. It is important to understand the transmission dynamics of the infection and impose localized strategies across different geographies to curtail the spread of the infection. The present study was designed to assess the transmission dynamics and the health systems’ burden of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) using an agent-based modeling (ABM) approach. The study used a synthetic population with 31,738,240 agents representing 90.67 percent of the overall population of Telangana, India. The effects of imposing and lifting lockdowns, nonpharmaceutical interventions, and the role of immunity were analyzed. The distribution of people in different health states was measured separately for each district of Telangana. The spread dramatically increased and reached a peak soon after the lockdowns were relaxed. It was evident that is the protection offered is higher when a higher proportion of the population is exposed to the interventions. ABMs help to analyze grassroots details compared to compartmental models. Risk estimates provide insights on the proportion of the population protected by the adoption of one or more of the control measures, which is of practical significance for policymaking.
Comprehensive study: machine learning approaches for COVID-19 diagnosis IJECEIAES
Coronavirus disease 2019 (COVID-19) is caused a large number of death since has declared as an international pandemic in December 2019, and it is spreading all over the world (more than 200 countries). This situation puts the health organizations in an aberrant demand for urgent needs to develop significant early detection and monitoring smart solutions. Therefore, that new system or solution might be capable to identify COVID-19 quickly and accurately. Nowadays, the science of artificial intelligence (AI), and internet of things (IoT) techniques have an extensive range of applications, it can be initiated a possible solution for early detection and accurate decisions. We believe, combine both of the IoT revolution and machine learning (ML) methods are expected to reshape healthcare treatment strategies to provide smart (diagnosis, treatments, monitoring, and hospitals). This work aims to overview the recent solutions that have been used for early detection, and to provide the researchers a comprehensive summary that contribute to the pandemic control such AI, IoT, cloud, fog, algorithms, and all the dataset and their sources that recently published. In addition, all models, frameworks, monitoring systems, devices, and ideas (in four sections) have been sufficiently presented with all clarifications and justifications. Also, we propose a new vision for early detection based on IoT sensors data entry using 1 million patients-data to verify three proposed methods.
An Overview on the Use of Data Mining and Linguistics Techniques for Building...ijcsit
The usage of Online Social Networks (OSN), such as Facebook and Twitter are becoming more and more
popular in order to exchange and disseminate news and information in real-time. Twitter in particular
allows the instant dissemination of short messages in the form of microblogs to followers. This Survey
reviews literature to explore and examine the usage of how OSNs, such as the microblogging tool Twitter,
can help in the detection of spreading epidemics. The paper highlights significant challenges in the field of
Natural Language Processing (NLP) when using microblog based Early Disease Detection Systems. For
instance, microblogging data is an unstructured collection of short messages (140 characters in Twitter),
with noise and non-standard use of the English language. Hence, research is currently exploring the field
of linguistics in order to determine the semantics of the text and uses data mining techniques in order to
extract useful information for disease spread detection. Furthermore, the survey discusses applications and
existing early disease detection systems based on OSNs and outlines directions for future research on
improving such systems based on a combination of linguistics methods, data mining techniques and
recommendation systems.
Review of Interoperability techniques in data acquisition of wireless ECG dev...iosrjce
Wireless medical devices with enhanced capability operating within Body Area Network (BAN) are
key elements in: telehealth, and remote health monitoring systems.They help to capture process and store vital
signs acquired through sensors and transfer them through communication protocols such as ZigBee and
Bluetooth to aggregators. However, due to the heterogeneity of devices and manufacturers’ proprietary
applications, the devices lack adequate interoperability between each other and Health Information Systems
(HIS), thus making it difficult to sort data and/ or make it usable to medical team. In this study we reviewed the
various approaches employed to facilitate interoperability in wearable wireless Electrocardiogram (ECG)
devices ranging from model driven interoperability (Information model, data models etc.); ECG format;
ontology; standards and mapping of physiological data, available in literature in the last five years with respect
to advancement in telecommunication. The findings indicate that wearable wireless ECG needs both
retrospective and anticipatory interoperability mechanisms due to advancement in telecommunication
technology especially with the promising fifth generation (5G) technology, to facilitate the transmission of the
right information from the patient to the healthcare provider.
Analyzing sentiment dynamics from sparse text coronavirus disease-19 vaccina...IJECEIAES
Social media platforms enable people exchange their thoughts, reactions, emotions regarding all aspects of their lives. Therefore, sentiment analysis using textual data is widely practiced field. Due to large textual content available on social media, sentiment analysis is usually considered a text classification task. The high feature dimension is an important issue that needs to be resolved by examining text meaningfully. The proposed study considers a case study of coronavirus (COVID) vaccination to conclude public opinions about prospects for vaccination. Text corpus of tweets is collected, published between December 12, 2020, and July 13, 2021 is considered. The proposed model is developed considering phase-by-phase data analysis process, followed by an assessment of important information about the collected tweets on coronavirus disease (COVID-19) vaccine using two sentiment analyzer methods and probabilistic models for validation and knowledge analysis. The result indicated that public sentiment is more positive than negative. The study also presented statistics of trends in vaccination progress in the top countries from early 2021 to July 2021. The scope of study is enormous regarding sentiment analysis based on keyword and document modeling. The proposed work offers an effective mechanism for a decision-making system to understand public opinion and accordingly assists policymakers in health measures and vaccination campaigns.
Assessment of the main features of the model of dissemination of information ...IJECEIAES
Social networks provide a fairly wide range of data that allows one way or another to evaluate the effect of the dissemination of information. This article presents the results of a study that describes methods for determining the key parameters of the model needed to analyze and predict the dissemination of information in social networks. An approach based on the analysis of statistical data on user behavior in social networks is proposed. The process of evaluating the main features of the model is described, including the mathematical methods used for data analysis and information dissemination modeling. The study aims to understand the processes of information dissemination in social networks and develop recommendations for the effective use of social networks as a communication and brand promotion tool, as well as to consider the analytical properties of the classical susceptible-infected-removed (SIR) model and evaluate its applicability to the problem of information dissemination. The results of the study can be used to create algorithms and techniques that will effectively manage the process of information dissemination in social networks.
Involving machine learning techniques in heart disease diagnosis: a performan...IJECEIAES
Artificial intelligence is a science that is growing at a tremendous speed every day and has become an essential part of many domains, including the medical domain. Therefore, countless artificial intelligence applications can be seen in the medical domain at various levels, which are employed to enhance early diagnosis and prediction and reduce the risks associated with many diseases, including heart diseases. In this article, machine learning techniques (logistic regression, random forest, artificial neural network, support vector machines, and k-nearest neighbors) are utilized to diagnose heart disease from the Cleveland Clinic dataset got from the University of California Irvine machine learning (UCL) repository and Kaggle platform then create a comparison between the performance of these techniques. In addition, some literature related to machine learning and deep learning techniques that aim to provide reasonable solutions in monitoring, detecting, diagnosing, and predicting heart disease and how these technologies assist in making health decisions are reviewed. Ten studies are selected and summarized by the authors published between 2017 and 2022 are illustrated. After executing a series of tests, it is seen that the most profitable performance in diagnosing heart disease is the support vector machines, with a diagnostic accuracy of 96%. This article has concluded that these techniques play a significant and influential role in assisting physicians and health care workers in analyzing heart patients' data, making health decisions, and saving patients' lives.
A comprehensive study on disease risk predictions in machine learning IJECEIAES
Over recent years, multiple disease risk prediction models have been developed. These models use various patient characteristics to estimate the probability of outcomes over a certain period of time and hold the potential to improve decision making and individualize care. Discovering hidden patterns and interactions from medical databases with growing evaluation of the disease prediction model has become crucial. It needs many trials in traditional clinical findings that could complicate disease prediction. A Comprehensive study on different strategies used to predict disease is conferred in this paper. Applying these techniques to healthcare data, has improvement of risk prediction models to find out the patients who would get benefit from disease management programs to reduce hospital readmission and healthcare cost, but the results of these endeavors have been shifted.
Classification of a COVID-19 dataset by using labels created from clustering ...nooriasukmaningtyas
Novel coronavirus (COVID-19) is a newly discovered infectious disease that has received much attention in the literature because of its rapid spread and daily global deaths attributable to such disease. The White House, together with a coalition of leading research groups, has published the freely available COVID-19 Open Research Dataset to help the global research community apply the recent advances in natural language processing and other AI techniques in generating novel insights that can support the ongoing fight against this disease. In this paper, the hierarchical and k-means clustering techniques are used to create a tool for identifying similar articles on COVID-19 and filtering them based on their titles. These articles are classified by applying three data mining techniques, namely, random forest (RF), decision tree (DT) and bagging. By using this tool, specialists can limit the number of articles they need to study and pre-process these articles via data framing, tokenisation, normalisation and term frequency-inverse document frequency. Given its 2D nature, the dimensionality of this dataset is reduced by applying t-SNE. The aforementioned data mining techniques are then cross validated to test the accuracy, precision and recall performance of the proposed tool. Results show that the proposed tool effectively extracts the keywords for each cluster, with RF, DT and bagging achieving optimal accuracies of 98.267%, 97.633% and 97.833%, respectively.
COVID-19 VACCINATION CLASSIFICATION OF OPINION MINING WITH SEMANTIC KNOWLEDGE...dannyijwest
The Covid-19 ontology is to classify the data using a supervised learning approach in machine learning,
which has been preprocessed. Afterthe classification is done, with thehelp of opinion mining with decisionmaking, the classified data is stored in the database using semantic webontology using the protégé tool.
The data will be retrieved through SPARQL which helps to retrieve complex queries, followed by the
output based on the given query. This Covid-19 ontology helps in analyzing the risk factors and treatment
plans for the respective individuals i.e., students based on their given details which include diagnosis,
symptoms, and vaccination history. The information given by the students can be automatically processed
and with the help of SWRL (Semantic Web Rule Language), the risk factor and treatment plans for the
students are inferred from the given knowledge.
The prediction of coronavirus disease 2019 outbreak on Bangladesh perspectiv...IJECEIAES
Coronavirus disease 2019 (COVID-19) has made a huge pandemic situation in many countries of the world including Bangladesh. If the increase rate of this threat can be forecasted, immediate measures can be taken. This study is an effort to forecast the threat of present pandemic situation using machine learning (ML) forecasting models. Forecasting was done in three categories in the next 30 days range. In our study, multiple linear regression performed best among the other algorithms in all categories with R2 score of 99% for first two categories and 94% for the third category. Ridge regression performed great for the first two categories with R2 scores of 99% each but performed poorly for the third category with R2 score of 43%. Lasso regression performed reasonably well with R2 scores of 97%, 99% and 75% for the three categories. We also used Facebook Prophet to predict 30 days beyond our train data which gave us healthy R2 scores of 92% and 83% for the first two categories but performed poorly for the third category with R2 score of 34%. Also, all the models’ performances were evaluated with a 40- day prediction interval in which multiple linear regression outperformed other algorithms.
Covid 19 Prediction in India using Machine LearningYogeshIJTSRD
Various computational models are used around the world to predict the number of infected individuals and the death rate of the COVID 19 outbreak 3 . Machine learning based models are important to take proper actions. Due to the ample of uncertainty and crucial data, the aerodynamic models have been challenged regarding higher accuracy for long term prediction of this disease 1 . By researching the COVID19 problem, it is observed that lockdown and isolation are important techniques for preventing the spread of COVID 19 2 . In India, public health and the economical condition are impacted by COVID 19, our goal is to visualize the spread of this disease 5 . Machine Learning Algorithms are used in various applications for detecting adverse risk factors. Three ML algorithms we are using that is Logistic Regression LR , Support Vector Machine SVM , and Random Forest Classifier RFC . These machine learning models are predicting the total number of recovered patients as per the date of each state in India 8 . Sarfraj Alam | Vipul Kumar | Sweta Singh | Sweta Joshi | Madhu Kirola "Covid-19 Prediction in India using Machine Learning" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Special Issue | International Conference on Advances in Engineering, Science and Technology - 2021 , May 2021, URL: https://www.ijtsrd.com/papers/ijtsrd42458.pdf Paper URL : https://www.ijtsrd.com/engineering/computer-engineering/42458/covid19-prediction-in-india-using-machine-learning/sarfraj-alam
Monitoring Indonesian online news for COVID-19 event detection using deep le...IJECEIAES
Even though coronavirus disease 2019 (COVID-19) vaccination has been done, preparedness for the possibility of the next outbreak wave is still needed with new mutations and virus variants. A near real-time surveillance system is required to provide the stakeholders, especially the public, to act in a timely response. Due to the hierarchical structure, epidemic reporting is usually slow particularly when passing jurisdictional borders. This condition could lead to time gaps for public awareness of new and emerging events of infectious diseases. Online news is a potential source for COVID-19 monitoring because it reports almost every infectious disease incident globally. However, the news does not report only about COVID-19 events, but also various information related to COVID-19 topics such as the economic impact, health tips, and others. We developed a framework for online news monitoring and applied sentence classification for news titles using deep learning to distinguish between COVID-19 events and non-event news. The classification results showed that the fine-tuned bidirectional encoder representations from transformers (BERT) trained with Bahasa Indonesia achieved the highest performance (accuracy: 95.16%, precision: 94.71%, recall: 94.32%, F1-score: 94.51%). Interestingly, our framework was able to identify news that reports the new COVID strain from the United Kingdom (UK) as an event news, 13 days before the Indonesian officials closed the border for foreigners.
A data mining analysis of COVID-19 cases in states of United States of AmericaIJECEIAES
Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches. As a result of the analysis some rules and insights have been discovered and performances of the data mining algorithms have been evaluated. According to the analysis results, JRip algorithmic technique had the most correct classification rate and the lowest root mean squared error (RMSE). Considering classification rate and RMSE measure, JRip can be considered as an effective method in understanding factors that are related with corona virus caused deaths.
Improving cyberbullying detection through multi-level machine learningIJECEIAES
Cyberbullying is a known risk factor for mental health issues, demanding immediate attention. This study aims to detect cyberbullying on social media in alignment with the third sustainable development goal (SDG) for health and well-being. Many previous studies employ single-level classification, but this research introduces a multi-class multi-level (MCML) algorithm for a more detailed approach. The MCML approach incorporates two levels of classification: level one for cyberbullying or not cyberbullying, and level two for classifying cyberbullying by type. This study used a dataset of 47,000 tweets from Twitter with six class labels and employed an 80:20 training and testing data split. By integrating bidirectional encoder representations from transformers (BERT) and MCML at level two, we achieved a remarkable 99% accuracy, surpassing BERT-based single-level classification at 94%. In conclusion, the combination of MCML and BERT offers enhanced cyberbullying classification accuracy, contributing to the broader goal of promoting mental health and well-being.
Gears are the most essential and commonly utilised power transmission components. It is really essential to operate machines involving different weights and speeds. When a load is increased beyond a certain limit, gear teeth frequently fail. Composite materials, in comparison to other metallic gears, offer significantly better mechanical qualities, such as a higher strength-to-weight ratio, increased hardness, and hence a lower risk of failure. Al6063 and SiC were employed to build a metal matrix composite for spur gear production in this work.
Sound plays a crucial part in every element of human life. Sound is a crucial component in the development of automated systems in a variety of domains, from personal security to essential monitoring. There are a few systems on the market now, but their efficiency is a worry for their use in real-world circumstances. Image classification and feature classification are the same as sound classification, just like other classification algorithms like machine learning.
Heaviness has been related to stroke, depression, and cancer are some of the most serious dangers to human existence. Heart disease, stroke, obesity, and type II diabetes are all disorders that have an impact on our way of life. Using data mining and machine learning approaches to forecast disease based on patient treatment history and health data has been a battle for decades.
Early coronavirus disease detection using internet of things smart systemIJECEIAES
The internet of things (IoT) is quickly evolving, allowing for the connecting of a wide range of smart devices in a variety of applications including industry, military, education, and health. Coronavirus has recently expanded fast across the world, and there are no particular therapies available at this moment. As a result, it is critical to avoid infection and watch signs like fever and shortness of breath. This research work proposes a smart and robust system that assists patients with influenza symptoms in determining whether or not they are infected with the coronavirus disease (COVID-19). In addition to the diagnostic capabilities of the system, the system aids these patients in obtaining medical care quickly by informing medical authorities via Blynk IoT. Moreover, the global positioning system (GPS) module is used to track patient mobility in order to locate contaminated regions and analyze suspected patient behaviors. Finally, this idea might be useful in medical institutions, quarantine units, airports, and other relevant fields.
Design and development of a fuzzy explainable expert system for a diagnostic ...IJECEIAES
Expert systems have been widely used in medicine to diagnose different diseases. However, these rule-based systems only explain why and how their outcomes are reached. The rules leading to those outcomes are also expressed in a machine language and confronted with the familiar problems of coverage and specificity. This fact prevents procuring expert systems with fully human-understandable explanations. Furthermore, early diagnosis involves a high degree of uncertainty and vagueness which constitutes another challenge to overcome in this study. This paper aims to design and develop a fuzzy explainable expert system for coronavirus disease-2019 (COVID-19) diagnosis that could be incorporated into medical robots. The proposed medical robotic application deduces the likelihood level of contracting COVID-19 from the entered symptoms, the personal information, and the patient's activities. The proposal integrates fuzzy logic to deal with uncertainty and vagueness in diagnosis. Besides, it adopts a hybrid explainable artificial intelligence (XAI) technique to provide different explanation forms. In particular, the textual explanations are generated as rules expressed in a natural language while avoiding coverage and specificity problems. Therefore, the proposal could help overwhelmed hospitals during the epidemic propagation and avoid contamination using a solution with a high level of explicability.
ARTIFICIAL NEURAL NETWORKS FOR MEDICAL DIAGNOSIS: A REVIEW OF RECENT TRENDSIJCSES Journal
Artificial Intelligence systems (especially computer-aided diagnosis and artificial neural networks) are increasingly finding many uses in medical diagnosis application in recent times. These methods are adaptive learning algorithms that are capable of handling multiple and heterogeneous types of clinical
data with a view of integrating them into categorized outputs. In this study, we briefly review and discuss the concept, capabilities, and applicability of artificial neural network techniques to medical diagnosis, through consideration of some selected physical and mental diseases. The study focuses on scholarly researches within the years, 2010 to 2019. Findings show that no electronic online clinical database exists in Nigeria and the Sub-Saharan countries, most review researches in this area focused mainly on physical diseases without considering mental illnesses, the application of ANN in mental and comorbid disorders have not been thoroughly studied, ANN models and algorithms consider mainly homogeneous input data sources and not heterogeneous input data sources, and ANN models on multi-objective output systems are few as compared to single output ANN models.
Current issues - International Journal of Computer Science and Engineering Su...IJCSES Journal
International Journal of Computer Science and Engineering Survey (IJCSES) is devoted to fields of Computer Science and Engineering surveys, tutorials and overviews. The IJCSES is a peer-reviewed, open access scientific journal published in electronic form as well as print form. The journal will publish research surveys, tutorials and expository overviews in computer science and engineering. Articles from supplementary fields are welcome, as long as they are relevant to computer science and engineering.
Bibliometric analysis highlighting the role of women in addressing climate ch...IJECEIAES
Fossil fuel consumption increased quickly, contributing to climate change
that is evident in unusual flooding and draughts, and global warming. Over
the past ten years, women's involvement in society has grown dramatically,
and they succeeded in playing a noticeable role in reducing climate change.
A bibliometric analysis of data from the last ten years has been carried out to
examine the role of women in addressing the climate change. The analysis's
findings discussed the relevant to the sustainable development goals (SDGs),
particularly SDG 7 and SDG 13. The results considered contributions made
by women in the various sectors while taking geographic dispersion into
account. The bibliometric analysis delves into topics including women's
leadership in environmental groups, their involvement in policymaking, their
contributions to sustainable development projects, and the influence of
gender diversity on attempts to mitigate climate change. This study's results
highlight how women have influenced policies and actions related to climate
change, point out areas of research deficiency and recommendations on how
to increase role of the women in addressing the climate change and
achieving sustainability. To achieve more successful results, this initiative
aims to highlight the significance of gender equality and encourage
inclusivity in climate change decision-making processes.
Voltage and frequency control of microgrid in presence of micro-turbine inter...IJECEIAES
The active and reactive load changes have a significant impact on voltage
and frequency. In this paper, in order to stabilize the microgrid (MG) against
load variations in islanding mode, the active and reactive power of all
distributed generators (DGs), including energy storage (battery), diesel
generator, and micro-turbine, are controlled. The micro-turbine generator is
connected to MG through a three-phase to three-phase matrix converter, and
the droop control method is applied for controlling the voltage and
frequency of MG. In addition, a method is introduced for voltage and
frequency control of micro-turbines in the transition state from gridconnected mode to islanding mode. A novel switching strategy of the matrix
converter is used for converting the high-frequency output voltage of the
micro-turbine to the grid-side frequency of the utility system. Moreover,
using the switching strategy, the low-order harmonics in the output current
and voltage are not produced, and consequently, the size of the output filter
would be reduced. In fact, the suggested control strategy is load-independent
and has no frequency conversion restrictions. The proposed approach for
voltage and frequency regulation demonstrates exceptional performance and
favorable response across various load alteration scenarios. The suggested
strategy is examined in several scenarios in the MG test systems, and the
simulation results are addressed.
More Related Content
Similar to Root cause analysis of COVID-19 cases by enhanced text mining process
Analyzing sentiment dynamics from sparse text coronavirus disease-19 vaccina...IJECEIAES
Social media platforms enable people exchange their thoughts, reactions, emotions regarding all aspects of their lives. Therefore, sentiment analysis using textual data is widely practiced field. Due to large textual content available on social media, sentiment analysis is usually considered a text classification task. The high feature dimension is an important issue that needs to be resolved by examining text meaningfully. The proposed study considers a case study of coronavirus (COVID) vaccination to conclude public opinions about prospects for vaccination. Text corpus of tweets is collected, published between December 12, 2020, and July 13, 2021 is considered. The proposed model is developed considering phase-by-phase data analysis process, followed by an assessment of important information about the collected tweets on coronavirus disease (COVID-19) vaccine using two sentiment analyzer methods and probabilistic models for validation and knowledge analysis. The result indicated that public sentiment is more positive than negative. The study also presented statistics of trends in vaccination progress in the top countries from early 2021 to July 2021. The scope of study is enormous regarding sentiment analysis based on keyword and document modeling. The proposed work offers an effective mechanism for a decision-making system to understand public opinion and accordingly assists policymakers in health measures and vaccination campaigns.
Assessment of the main features of the model of dissemination of information ...IJECEIAES
Social networks provide a fairly wide range of data that allows one way or another to evaluate the effect of the dissemination of information. This article presents the results of a study that describes methods for determining the key parameters of the model needed to analyze and predict the dissemination of information in social networks. An approach based on the analysis of statistical data on user behavior in social networks is proposed. The process of evaluating the main features of the model is described, including the mathematical methods used for data analysis and information dissemination modeling. The study aims to understand the processes of information dissemination in social networks and develop recommendations for the effective use of social networks as a communication and brand promotion tool, as well as to consider the analytical properties of the classical susceptible-infected-removed (SIR) model and evaluate its applicability to the problem of information dissemination. The results of the study can be used to create algorithms and techniques that will effectively manage the process of information dissemination in social networks.
Involving machine learning techniques in heart disease diagnosis: a performan...IJECEIAES
Artificial intelligence is a science that is growing at a tremendous speed every day and has become an essential part of many domains, including the medical domain. Therefore, countless artificial intelligence applications can be seen in the medical domain at various levels, which are employed to enhance early diagnosis and prediction and reduce the risks associated with many diseases, including heart diseases. In this article, machine learning techniques (logistic regression, random forest, artificial neural network, support vector machines, and k-nearest neighbors) are utilized to diagnose heart disease from the Cleveland Clinic dataset got from the University of California Irvine machine learning (UCL) repository and Kaggle platform then create a comparison between the performance of these techniques. In addition, some literature related to machine learning and deep learning techniques that aim to provide reasonable solutions in monitoring, detecting, diagnosing, and predicting heart disease and how these technologies assist in making health decisions are reviewed. Ten studies are selected and summarized by the authors published between 2017 and 2022 are illustrated. After executing a series of tests, it is seen that the most profitable performance in diagnosing heart disease is the support vector machines, with a diagnostic accuracy of 96%. This article has concluded that these techniques play a significant and influential role in assisting physicians and health care workers in analyzing heart patients' data, making health decisions, and saving patients' lives.
A comprehensive study on disease risk predictions in machine learning IJECEIAES
Over recent years, multiple disease risk prediction models have been developed. These models use various patient characteristics to estimate the probability of outcomes over a certain period of time and hold the potential to improve decision making and individualize care. Discovering hidden patterns and interactions from medical databases with growing evaluation of the disease prediction model has become crucial. It needs many trials in traditional clinical findings that could complicate disease prediction. A Comprehensive study on different strategies used to predict disease is conferred in this paper. Applying these techniques to healthcare data, has improvement of risk prediction models to find out the patients who would get benefit from disease management programs to reduce hospital readmission and healthcare cost, but the results of these endeavors have been shifted.
Classification of a COVID-19 dataset by using labels created from clustering ...nooriasukmaningtyas
Novel coronavirus (COVID-19) is a newly discovered infectious disease that has received much attention in the literature because of its rapid spread and daily global deaths attributable to such disease. The White House, together with a coalition of leading research groups, has published the freely available COVID-19 Open Research Dataset to help the global research community apply the recent advances in natural language processing and other AI techniques in generating novel insights that can support the ongoing fight against this disease. In this paper, the hierarchical and k-means clustering techniques are used to create a tool for identifying similar articles on COVID-19 and filtering them based on their titles. These articles are classified by applying three data mining techniques, namely, random forest (RF), decision tree (DT) and bagging. By using this tool, specialists can limit the number of articles they need to study and pre-process these articles via data framing, tokenisation, normalisation and term frequency-inverse document frequency. Given its 2D nature, the dimensionality of this dataset is reduced by applying t-SNE. The aforementioned data mining techniques are then cross validated to test the accuracy, precision and recall performance of the proposed tool. Results show that the proposed tool effectively extracts the keywords for each cluster, with RF, DT and bagging achieving optimal accuracies of 98.267%, 97.633% and 97.833%, respectively.
COVID-19 VACCINATION CLASSIFICATION OF OPINION MINING WITH SEMANTIC KNOWLEDGE...dannyijwest
The Covid-19 ontology is to classify the data using a supervised learning approach in machine learning,
which has been preprocessed. Afterthe classification is done, with thehelp of opinion mining with decisionmaking, the classified data is stored in the database using semantic webontology using the protégé tool.
The data will be retrieved through SPARQL which helps to retrieve complex queries, followed by the
output based on the given query. This Covid-19 ontology helps in analyzing the risk factors and treatment
plans for the respective individuals i.e., students based on their given details which include diagnosis,
symptoms, and vaccination history. The information given by the students can be automatically processed
and with the help of SWRL (Semantic Web Rule Language), the risk factor and treatment plans for the
students are inferred from the given knowledge.
The prediction of coronavirus disease 2019 outbreak on Bangladesh perspectiv...IJECEIAES
Coronavirus disease 2019 (COVID-19) has made a huge pandemic situation in many countries of the world including Bangladesh. If the increase rate of this threat can be forecasted, immediate measures can be taken. This study is an effort to forecast the threat of present pandemic situation using machine learning (ML) forecasting models. Forecasting was done in three categories in the next 30 days range. In our study, multiple linear regression performed best among the other algorithms in all categories with R2 score of 99% for first two categories and 94% for the third category. Ridge regression performed great for the first two categories with R2 scores of 99% each but performed poorly for the third category with R2 score of 43%. Lasso regression performed reasonably well with R2 scores of 97%, 99% and 75% for the three categories. We also used Facebook Prophet to predict 30 days beyond our train data which gave us healthy R2 scores of 92% and 83% for the first two categories but performed poorly for the third category with R2 score of 34%. Also, all the models’ performances were evaluated with a 40- day prediction interval in which multiple linear regression outperformed other algorithms.
Covid 19 Prediction in India using Machine LearningYogeshIJTSRD
Various computational models are used around the world to predict the number of infected individuals and the death rate of the COVID 19 outbreak 3 . Machine learning based models are important to take proper actions. Due to the ample of uncertainty and crucial data, the aerodynamic models have been challenged regarding higher accuracy for long term prediction of this disease 1 . By researching the COVID19 problem, it is observed that lockdown and isolation are important techniques for preventing the spread of COVID 19 2 . In India, public health and the economical condition are impacted by COVID 19, our goal is to visualize the spread of this disease 5 . Machine Learning Algorithms are used in various applications for detecting adverse risk factors. Three ML algorithms we are using that is Logistic Regression LR , Support Vector Machine SVM , and Random Forest Classifier RFC . These machine learning models are predicting the total number of recovered patients as per the date of each state in India 8 . Sarfraj Alam | Vipul Kumar | Sweta Singh | Sweta Joshi | Madhu Kirola "Covid-19 Prediction in India using Machine Learning" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Special Issue | International Conference on Advances in Engineering, Science and Technology - 2021 , May 2021, URL: https://www.ijtsrd.com/papers/ijtsrd42458.pdf Paper URL : https://www.ijtsrd.com/engineering/computer-engineering/42458/covid19-prediction-in-india-using-machine-learning/sarfraj-alam
Monitoring Indonesian online news for COVID-19 event detection using deep le...IJECEIAES
Even though coronavirus disease 2019 (COVID-19) vaccination has been done, preparedness for the possibility of the next outbreak wave is still needed with new mutations and virus variants. A near real-time surveillance system is required to provide the stakeholders, especially the public, to act in a timely response. Due to the hierarchical structure, epidemic reporting is usually slow particularly when passing jurisdictional borders. This condition could lead to time gaps for public awareness of new and emerging events of infectious diseases. Online news is a potential source for COVID-19 monitoring because it reports almost every infectious disease incident globally. However, the news does not report only about COVID-19 events, but also various information related to COVID-19 topics such as the economic impact, health tips, and others. We developed a framework for online news monitoring and applied sentence classification for news titles using deep learning to distinguish between COVID-19 events and non-event news. The classification results showed that the fine-tuned bidirectional encoder representations from transformers (BERT) trained with Bahasa Indonesia achieved the highest performance (accuracy: 95.16%, precision: 94.71%, recall: 94.32%, F1-score: 94.51%). Interestingly, our framework was able to identify news that reports the new COVID strain from the United Kingdom (UK) as an event news, 13 days before the Indonesian officials closed the border for foreigners.
A data mining analysis of COVID-19 cases in states of United States of AmericaIJECEIAES
Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches. As a result of the analysis some rules and insights have been discovered and performances of the data mining algorithms have been evaluated. According to the analysis results, JRip algorithmic technique had the most correct classification rate and the lowest root mean squared error (RMSE). Considering classification rate and RMSE measure, JRip can be considered as an effective method in understanding factors that are related with corona virus caused deaths.
Improving cyberbullying detection through multi-level machine learningIJECEIAES
Cyberbullying is a known risk factor for mental health issues, demanding immediate attention. This study aims to detect cyberbullying on social media in alignment with the third sustainable development goal (SDG) for health and well-being. Many previous studies employ single-level classification, but this research introduces a multi-class multi-level (MCML) algorithm for a more detailed approach. The MCML approach incorporates two levels of classification: level one for cyberbullying or not cyberbullying, and level two for classifying cyberbullying by type. This study used a dataset of 47,000 tweets from Twitter with six class labels and employed an 80:20 training and testing data split. By integrating bidirectional encoder representations from transformers (BERT) and MCML at level two, we achieved a remarkable 99% accuracy, surpassing BERT-based single-level classification at 94%. In conclusion, the combination of MCML and BERT offers enhanced cyberbullying classification accuracy, contributing to the broader goal of promoting mental health and well-being.
Gears are the most essential and commonly utilised power transmission components. It is really essential to operate machines involving different weights and speeds. When a load is increased beyond a certain limit, gear teeth frequently fail. Composite materials, in comparison to other metallic gears, offer significantly better mechanical qualities, such as a higher strength-to-weight ratio, increased hardness, and hence a lower risk of failure. Al6063 and SiC were employed to build a metal matrix composite for spur gear production in this work.
Sound plays a crucial part in every element of human life. Sound is a crucial component in the development of automated systems in a variety of domains, from personal security to essential monitoring. There are a few systems on the market now, but their efficiency is a worry for their use in real-world circumstances. Image classification and feature classification are the same as sound classification, just like other classification algorithms like machine learning.
Heaviness has been related to stroke, depression, and cancer are some of the most serious dangers to human existence. Heart disease, stroke, obesity, and type II diabetes are all disorders that have an impact on our way of life. Using data mining and machine learning approaches to forecast disease based on patient treatment history and health data has been a battle for decades.
Early coronavirus disease detection using internet of things smart systemIJECEIAES
The internet of things (IoT) is quickly evolving, allowing for the connecting of a wide range of smart devices in a variety of applications including industry, military, education, and health. Coronavirus has recently expanded fast across the world, and there are no particular therapies available at this moment. As a result, it is critical to avoid infection and watch signs like fever and shortness of breath. This research work proposes a smart and robust system that assists patients with influenza symptoms in determining whether or not they are infected with the coronavirus disease (COVID-19). In addition to the diagnostic capabilities of the system, the system aids these patients in obtaining medical care quickly by informing medical authorities via Blynk IoT. Moreover, the global positioning system (GPS) module is used to track patient mobility in order to locate contaminated regions and analyze suspected patient behaviors. Finally, this idea might be useful in medical institutions, quarantine units, airports, and other relevant fields.
Design and development of a fuzzy explainable expert system for a diagnostic ...IJECEIAES
Expert systems have been widely used in medicine to diagnose different diseases. However, these rule-based systems only explain why and how their outcomes are reached. The rules leading to those outcomes are also expressed in a machine language and confronted with the familiar problems of coverage and specificity. This fact prevents procuring expert systems with fully human-understandable explanations. Furthermore, early diagnosis involves a high degree of uncertainty and vagueness which constitutes another challenge to overcome in this study. This paper aims to design and develop a fuzzy explainable expert system for coronavirus disease-2019 (COVID-19) diagnosis that could be incorporated into medical robots. The proposed medical robotic application deduces the likelihood level of contracting COVID-19 from the entered symptoms, the personal information, and the patient's activities. The proposal integrates fuzzy logic to deal with uncertainty and vagueness in diagnosis. Besides, it adopts a hybrid explainable artificial intelligence (XAI) technique to provide different explanation forms. In particular, the textual explanations are generated as rules expressed in a natural language while avoiding coverage and specificity problems. Therefore, the proposal could help overwhelmed hospitals during the epidemic propagation and avoid contamination using a solution with a high level of explicability.
ARTIFICIAL NEURAL NETWORKS FOR MEDICAL DIAGNOSIS: A REVIEW OF RECENT TRENDSIJCSES Journal
Artificial Intelligence systems (especially computer-aided diagnosis and artificial neural networks) are increasingly finding many uses in medical diagnosis application in recent times. These methods are adaptive learning algorithms that are capable of handling multiple and heterogeneous types of clinical
data with a view of integrating them into categorized outputs. In this study, we briefly review and discuss the concept, capabilities, and applicability of artificial neural network techniques to medical diagnosis, through consideration of some selected physical and mental diseases. The study focuses on scholarly researches within the years, 2010 to 2019. Findings show that no electronic online clinical database exists in Nigeria and the Sub-Saharan countries, most review researches in this area focused mainly on physical diseases without considering mental illnesses, the application of ANN in mental and comorbid disorders have not been thoroughly studied, ANN models and algorithms consider mainly homogeneous input data sources and not heterogeneous input data sources, and ANN models on multi-objective output systems are few as compared to single output ANN models.
Current issues - International Journal of Computer Science and Engineering Su...IJCSES Journal
International Journal of Computer Science and Engineering Survey (IJCSES) is devoted to fields of Computer Science and Engineering surveys, tutorials and overviews. The IJCSES is a peer-reviewed, open access scientific journal published in electronic form as well as print form. The journal will publish research surveys, tutorials and expository overviews in computer science and engineering. Articles from supplementary fields are welcome, as long as they are relevant to computer science and engineering.
Similar to Root cause analysis of COVID-19 cases by enhanced text mining process (20)
Bibliometric analysis highlighting the role of women in addressing climate ch...IJECEIAES
Fossil fuel consumption increased quickly, contributing to climate change
that is evident in unusual flooding and draughts, and global warming. Over
the past ten years, women's involvement in society has grown dramatically,
and they succeeded in playing a noticeable role in reducing climate change.
A bibliometric analysis of data from the last ten years has been carried out to
examine the role of women in addressing the climate change. The analysis's
findings discussed the relevant to the sustainable development goals (SDGs),
particularly SDG 7 and SDG 13. The results considered contributions made
by women in the various sectors while taking geographic dispersion into
account. The bibliometric analysis delves into topics including women's
leadership in environmental groups, their involvement in policymaking, their
contributions to sustainable development projects, and the influence of
gender diversity on attempts to mitigate climate change. This study's results
highlight how women have influenced policies and actions related to climate
change, point out areas of research deficiency and recommendations on how
to increase role of the women in addressing the climate change and
achieving sustainability. To achieve more successful results, this initiative
aims to highlight the significance of gender equality and encourage
inclusivity in climate change decision-making processes.
Voltage and frequency control of microgrid in presence of micro-turbine inter...IJECEIAES
The active and reactive load changes have a significant impact on voltage
and frequency. In this paper, in order to stabilize the microgrid (MG) against
load variations in islanding mode, the active and reactive power of all
distributed generators (DGs), including energy storage (battery), diesel
generator, and micro-turbine, are controlled. The micro-turbine generator is
connected to MG through a three-phase to three-phase matrix converter, and
the droop control method is applied for controlling the voltage and
frequency of MG. In addition, a method is introduced for voltage and
frequency control of micro-turbines in the transition state from gridconnected mode to islanding mode. A novel switching strategy of the matrix
converter is used for converting the high-frequency output voltage of the
micro-turbine to the grid-side frequency of the utility system. Moreover,
using the switching strategy, the low-order harmonics in the output current
and voltage are not produced, and consequently, the size of the output filter
would be reduced. In fact, the suggested control strategy is load-independent
and has no frequency conversion restrictions. The proposed approach for
voltage and frequency regulation demonstrates exceptional performance and
favorable response across various load alteration scenarios. The suggested
strategy is examined in several scenarios in the MG test systems, and the
simulation results are addressed.
Enhancing battery system identification: nonlinear autoregressive modeling fo...IJECEIAES
Precisely characterizing Li-ion batteries is essential for optimizing their
performance, enhancing safety, and prolonging their lifespan across various
applications, such as electric vehicles and renewable energy systems. This
article introduces an innovative nonlinear methodology for system
identification of a Li-ion battery, employing a nonlinear autoregressive with
exogenous inputs (NARX) model. The proposed approach integrates the
benefits of nonlinear modeling with the adaptability of the NARX structure,
facilitating a more comprehensive representation of the intricate
electrochemical processes within the battery. Experimental data collected
from a Li-ion battery operating under diverse scenarios are employed to
validate the effectiveness of the proposed methodology. The identified
NARX model exhibits superior accuracy in predicting the battery's behavior
compared to traditional linear models. This study underscores the
importance of accounting for nonlinearities in battery modeling, providing
insights into the intricate relationships between state-of-charge, voltage, and
current under dynamic conditions.
Smart grid deployment: from a bibliometric analysis to a surveyIJECEIAES
Smart grids are one of the last decades' innovations in electrical energy.
They bring relevant advantages compared to the traditional grid and
significant interest from the research community. Assessing the field's
evolution is essential to propose guidelines for facing new and future smart
grid challenges. In addition, knowing the main technologies involved in the
deployment of smart grids (SGs) is important to highlight possible
shortcomings that can be mitigated by developing new tools. This paper
contributes to the research trends mentioned above by focusing on two
objectives. First, a bibliometric analysis is presented to give an overview of
the current research level about smart grid deployment. Second, a survey of
the main technological approaches used for smart grid implementation and
their contributions are highlighted. To that effect, we searched the Web of
Science (WoS), and the Scopus databases. We obtained 5,663 documents
from WoS and 7,215 from Scopus on smart grid implementation or
deployment. With the extraction limitation in the Scopus database, 5,872 of
the 7,215 documents were extracted using a multi-step process. These two
datasets have been analyzed using a bibliometric tool called bibliometrix.
The main outputs are presented with some recommendations for future
research.
Use of analytical hierarchy process for selecting and prioritizing islanding ...IJECEIAES
One of the problems that are associated to power systems is islanding
condition, which must be rapidly and properly detected to prevent any
negative consequences on the system's protection, stability, and security.
This paper offers a thorough overview of several islanding detection
strategies, which are divided into two categories: classic approaches,
including local and remote approaches, and modern techniques, including
techniques based on signal processing and computational intelligence.
Additionally, each approach is compared and assessed based on several
factors, including implementation costs, non-detected zones, declining
power quality, and response times using the analytical hierarchy process
(AHP). The multi-criteria decision-making analysis shows that the overall
weight of passive methods (24.7%), active methods (7.8%), hybrid methods
(5.6%), remote methods (14.5%), signal processing-based methods (26.6%),
and computational intelligent-based methods (20.8%) based on the
comparison of all criteria together. Thus, it can be seen from the total weight
that hybrid approaches are the least suitable to be chosen, while signal
processing-based methods are the most appropriate islanding detection
method to be selected and implemented in power system with respect to the
aforementioned factors. Using Expert Choice software, the proposed
hierarchy model is studied and examined.
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...IJECEIAES
The power generated by photovoltaic (PV) systems is influenced by
environmental factors. This variability hampers the control and utilization of
solar cells' peak output. In this study, a single-stage grid-connected PV
system is designed to enhance power quality. Our approach employs fuzzy
logic in the direct power control (DPC) of a three-phase voltage source
inverter (VSI), enabling seamless integration of the PV connected to the
grid. Additionally, a fuzzy logic-based maximum power point tracking
(MPPT) controller is adopted, which outperforms traditional methods like
incremental conductance (INC) in enhancing solar cell efficiency and
minimizing the response time. Moreover, the inverter's real-time active and
reactive power is directly managed to achieve a unity power factor (UPF).
The system's performance is assessed through MATLAB/Simulink
implementation, showing marked improvement over conventional methods,
particularly in steady-state and varying weather conditions. For solar
irradiances of 500 and 1,000 W/m2
, the results show that the proposed
method reduces the total harmonic distortion (THD) of the injected current
to the grid by approximately 46% and 38% compared to conventional
methods, respectively. Furthermore, we compare the simulation results with
IEEE standards to evaluate the system's grid compatibility.
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...IJECEIAES
Photovoltaic systems have emerged as a promising energy resource that
caters to the future needs of society, owing to their renewable, inexhaustible,
and cost-free nature. The power output of these systems relies on solar cell
radiation and temperature. In order to mitigate the dependence on
atmospheric conditions and enhance power tracking, a conventional
approach has been improved by integrating various methods. To optimize
the generation of electricity from solar systems, the maximum power point
tracking (MPPT) technique is employed. To overcome limitations such as
steady-state voltage oscillations and improve transient response, two
traditional MPPT methods, namely fuzzy logic controller (FLC) and perturb
and observe (P&O), have been modified. This research paper aims to
simulate and validate the step size of the proposed modified P&O and FLC
techniques within the MPPT algorithm using MATLAB/Simulink for
efficient power tracking in photovoltaic systems.
Adaptive synchronous sliding control for a robot manipulator based on neural ...IJECEIAES
Robot manipulators have become important equipment in production lines, medical fields, and transportation. Improving the quality of trajectory tracking for
robot hands is always an attractive topic in the research community. This is a
challenging problem because robot manipulators are complex nonlinear systems
and are often subject to fluctuations in loads and external disturbances. This
article proposes an adaptive synchronous sliding control scheme to improve trajectory tracking performance for a robot manipulator. The proposed controller
ensures that the positions of the joints track the desired trajectory, synchronize
the errors, and significantly reduces chattering. First, the synchronous tracking
errors and synchronous sliding surfaces are presented. Second, the synchronous
tracking error dynamics are determined. Third, a robust adaptive control law is
designed,the unknown components of the model are estimated online by the neural network, and the parameters of the switching elements are selected by fuzzy
logic. The built algorithm ensures that the tracking and approximation errors
are ultimately uniformly bounded (UUB). Finally, the effectiveness of the constructed algorithm is demonstrated through simulation and experimental results.
Simulation and experimental results show that the proposed controller is effective with small synchronous tracking errors, and the chattering phenomenon is
significantly reduced.
Remote field-programmable gate array laboratory for signal acquisition and de...IJECEIAES
A remote laboratory utilizing field-programmable gate array (FPGA) technologies enhances students’ learning experience anywhere and anytime in embedded system design. Existing remote laboratories prioritize hardware access and visual feedback for observing board behavior after programming, neglecting comprehensive debugging tools to resolve errors that require internal signal acquisition. This paper proposes a novel remote embeddedsystem design approach targeting FPGA technologies that are fully interactive via a web-based platform. Our solution provides FPGA board access and debugging capabilities beyond the visual feedback provided by existing remote laboratories. We implemented a lab module that allows users to seamlessly incorporate into their FPGA design. The module minimizes hardware resource utilization while enabling the acquisition of a large number of data samples from the signal during the experiments by adaptively compressing the signal prior to data transmission. The results demonstrate an average compression ratio of 2.90 across three benchmark signals, indicating efficient signal acquisition and effective debugging and analysis. This method allows users to acquire more data samples than conventional methods. The proposed lab allows students to remotely test and debug their designs, bridging the gap between theory and practice in embedded system design.
Detecting and resolving feature envy through automated machine learning and m...IJECEIAES
Efficiently identifying and resolving code smells enhances software project quality. This paper presents a novel solution, utilizing automated machine learning (AutoML) techniques, to detect code smells and apply move method refactoring. By evaluating code metrics before and after refactoring, we assessed its impact on coupling, complexity, and cohesion. Key contributions of this research include a unique dataset for code smell classification and the development of models using AutoGluon for optimal performance. Furthermore, the study identifies the top 20 influential features in classifying feature envy, a well-known code smell, stemming from excessive reliance on external classes. We also explored how move method refactoring addresses feature envy, revealing reduced coupling and complexity, and improved cohesion, ultimately enhancing code quality. In summary, this research offers an empirical, data-driven approach, integrating AutoML and move method refactoring to optimize software project quality. Insights gained shed light on the benefits of refactoring on code quality and the significance of specific features in detecting feature envy. Future research can expand to explore additional refactoring techniques and a broader range of code metrics, advancing software engineering practices and standards.
Smart monitoring technique for solar cell systems using internet of things ba...IJECEIAES
Rapidly and remotely monitoring and receiving the solar cell systems status parameters, solar irradiance, temperature, and humidity, are critical issues in enhancement their efficiency. Hence, in the present article an improved smart prototype of internet of things (IoT) technique based on embedded system through NodeMCU ESP8266 (ESP-12E) was carried out experimentally. Three different regions at Egypt; Luxor, Cairo, and El-Beheira cities were chosen to study their solar irradiance profile, temperature, and humidity by the proposed IoT system. The monitoring data of solar irradiance, temperature, and humidity were live visualized directly by Ubidots through hypertext transfer protocol (HTTP) protocol. The measured solar power radiation in Luxor, Cairo, and El-Beheira ranged between 216-1000, 245-958, and 187-692 W/m 2 respectively during the solar day. The accuracy and rapidity of obtaining monitoring results using the proposed IoT system made it a strong candidate for application in monitoring solar cell systems. On the other hand, the obtained solar power radiation results of the three considered regions strongly candidate Luxor and Cairo as suitable places to build up a solar cells system station rather than El-Beheira.
An efficient security framework for intrusion detection and prevention in int...IJECEIAES
Over the past few years, the internet of things (IoT) has advanced to connect billions of smart devices to improve quality of life. However, anomalies or malicious intrusions pose several security loopholes, leading to performance degradation and threat to data security in IoT operations. Thereby, IoT security systems must keep an eye on and restrict unwanted events from occurring in the IoT network. Recently, various technical solutions based on machine learning (ML) models have been derived towards identifying and restricting unwanted events in IoT. However, most ML-based approaches are prone to miss-classification due to inappropriate feature selection. Additionally, most ML approaches applied to intrusion detection and prevention consider supervised learning, which requires a large amount of labeled data to be trained. Consequently, such complex datasets are impossible to source in a large network like IoT. To address this problem, this proposed study introduces an efficient learning mechanism to strengthen the IoT security aspects. The proposed algorithm incorporates supervised and unsupervised approaches to improve the learning models for intrusion detection and mitigation. Compared with the related works, the experimental outcome shows that the model performs well in a benchmark dataset. It accomplishes an improved detection accuracy of approximately 99.21%.
Developing a smart system for infant incubators using the internet of things ...IJECEIAES
This research is developing an incubator system that integrates the internet of things and artificial intelligence to improve care for premature babies. The system workflow starts with sensors that collect data from the incubator. Then, the data is sent in real-time to the internet of things (IoT) broker eclipse mosquito using the message queue telemetry transport (MQTT) protocol version 5.0. After that, the data is stored in a database for analysis using the long short-term memory network (LSTM) method and displayed in a web application using an application programming interface (API) service. Furthermore, the experimental results produce as many as 2,880 rows of data stored in the database. The correlation coefficient between the target attribute and other attributes ranges from 0.23 to 0.48. Next, several experiments were conducted to evaluate the model-predicted value on the test data. The best results are obtained using a two-layer LSTM configuration model, each with 60 neurons and a lookback setting 6. This model produces an R 2 value of 0.934, with a root mean square error (RMSE) value of 0.015 and a mean absolute error (MAE) of 0.008. In addition, the R 2 value was also evaluated for each attribute used as input, with a result of values between 0.590 and 0.845.
A review on internet of things-based stingless bee's honey production with im...IJECEIAES
Honey is produced exclusively by honeybees and stingless bees which both are well adapted to tropical and subtropical regions such as Malaysia. Stingless bees are known for producing small amounts of honey and are known for having a unique flavor profile. Problem identified that many stingless bees collapsed due to weather, temperature and environment. It is critical to understand the relationship between the production of stingless bee honey and environmental conditions to improve honey production. Thus, this paper presents a review on stingless bee's honey production and prediction modeling. About 54 previous research has been analyzed and compared in identifying the research gaps. A framework on modeling the prediction of stingless bee honey is derived. The result presents the comparison and analysis on the internet of things (IoT) monitoring systems, honey production estimation, convolution neural networks (CNNs), and automatic identification methods on bee species. It is identified based on image detection method the top best three efficiency presents CNN is at 98.67%, densely connected convolutional networks with YOLO v3 is 97.7%, and DenseNet201 convolutional networks 99.81%. This study is significant to assist the researcher in developing a model for predicting stingless honey produced by bee's output, which is important for a stable economy and food security.
A trust based secure access control using authentication mechanism for intero...IJECEIAES
The internet of things (IoT) is a revolutionary innovation in many aspects of our society including interactions, financial activity, and global security such as the military and battlefield internet. Due to the limited energy and processing capacity of network devices, security, energy consumption, compatibility, and device heterogeneity are the long-term IoT problems. As a result, energy and security are critical for data transmission across edge and IoT networks. Existing IoT interoperability techniques need more computation time, have unreliable authentication mechanisms that break easily, lose data easily, and have low confidentiality. In this paper, a key agreement protocol-based authentication mechanism for IoT devices is offered as a solution to this issue. This system makes use of information exchange, which must be secured to prevent access by unauthorized users. Using a compact contiki/cooja simulator, the performance and design of the suggested framework are validated. The simulation findings are evaluated based on detection of malicious nodes after 60 minutes of simulation. The suggested trust method, which is based on privacy access control, reduced packet loss ratio to 0.32%, consumed 0.39% power, and had the greatest average residual energy of 0.99 mJoules at 10 nodes.
Fuzzy linear programming with the intuitionistic polygonal fuzzy numbersIJECEIAES
In real world applications, data are subject to ambiguity due to several factors; fuzzy sets and fuzzy numbers propose a great tool to model such ambiguity. In case of hesitation, the complement of a membership value in fuzzy numbers can be different from the non-membership value, in which case we can model using intuitionistic fuzzy numbers as they provide flexibility by defining both a membership and a non-membership functions. In this article, we consider the intuitionistic fuzzy linear programming problem with intuitionistic polygonal fuzzy numbers, which is a generalization of the previous polygonal fuzzy numbers found in the literature. We present a modification of the simplex method that can be used to solve any general intuitionistic fuzzy linear programming problem after approximating the problem by an intuitionistic polygonal fuzzy number with n edges. This method is given in a simple tableau formulation, and then applied on numerical examples for clarity.
The performance of artificial intelligence in prostate magnetic resonance im...IJECEIAES
Prostate cancer is the predominant form of cancer observed in men worldwide. The application of magnetic resonance imaging (MRI) as a guidance tool for conducting biopsies has been established as a reliable and well-established approach in the diagnosis of prostate cancer. The diagnostic performance of MRI-guided prostate cancer diagnosis exhibits significant heterogeneity due to the intricate and multi-step nature of the diagnostic pathway. The development of artificial intelligence (AI) models, specifically through the utilization of machine learning techniques such as deep learning, is assuming an increasingly significant role in the field of radiology. In the realm of prostate MRI, a considerable body of literature has been dedicated to the development of various AI algorithms. These algorithms have been specifically designed for tasks such as prostate segmentation, lesion identification, and classification. The overarching objective of these endeavors is to enhance diagnostic performance and foster greater agreement among different observers within MRI scans for the prostate. This review article aims to provide a concise overview of the application of AI in the field of radiology, with a specific focus on its utilization in prostate MRI.
Seizure stage detection of epileptic seizure using convolutional neural networksIJECEIAES
According to the World Health Organization (WHO), seventy million individuals worldwide suffer from epilepsy, a neurological disorder. While electroencephalography (EEG) is crucial for diagnosing epilepsy and monitoring the brain activity of epilepsy patients, it requires a specialist to examine all EEG recordings to find epileptic behavior. This procedure needs an experienced doctor, and a precise epilepsy diagnosis is crucial for appropriate treatment. To identify epileptic seizures, this study employed a convolutional neural network (CNN) based on raw scalp EEG signals to discriminate between preictal, ictal, postictal, and interictal segments. The possibility of these characteristics is explored by examining how well timedomain signals work in the detection of epileptic signals using intracranial Freiburg Hospital (FH), scalp Children's Hospital Boston-Massachusetts Institute of Technology (CHB-MIT) databases, and Temple University Hospital (TUH) EEG. To test the viability of this approach, two types of experiments were carried out. Firstly, binary class classification (preictal, ictal, postictal each versus interictal) and four-class classification (interictal versus preictal versus ictal versus postictal). The average accuracy for stage detection using CHB-MIT database was 84.4%, while the Freiburg database's time-domain signals had an accuracy of 79.7% and the highest accuracy of 94.02% for classification in the TUH EEG database when comparing interictal stage to preictal stage.
Analysis of driving style using self-organizing maps to analyze driver behaviorIJECEIAES
Modern life is strongly associated with the use of cars, but the increase in acceleration speeds and their maneuverability leads to a dangerous driving style for some drivers. In these conditions, the development of a method that allows you to track the behavior of the driver is relevant. The article provides an overview of existing methods and models for assessing the functioning of motor vehicles and driver behavior. Based on this, a combined algorithm for recognizing driving style is proposed. To do this, a set of input data was formed, including 20 descriptive features: About the environment, the driver's behavior and the characteristics of the functioning of the car, collected using OBD II. The generated data set is sent to the Kohonen network, where clustering is performed according to driving style and degree of danger. Getting the driving characteristics into a particular cluster allows you to switch to the private indicators of an individual driver and considering individual driving characteristics. The application of the method allows you to identify potentially dangerous driving styles that can prevent accidents.
Hyperspectral object classification using hybrid spectral-spatial fusion and ...IJECEIAES
Because of its spectral-spatial and temporal resolution of greater areas, hyperspectral imaging (HSI) has found widespread application in the field of object classification. The HSI is typically used to accurately determine an object's physical characteristics as well as to locate related objects with appropriate spectral fingerprints. As a result, the HSI has been extensively applied to object identification in several fields, including surveillance, agricultural monitoring, environmental research, and precision agriculture. However, because of their enormous size, objects require a lot of time to classify; for this reason, both spectral and spatial feature fusion have been completed. The existing classification strategy leads to increased misclassification, and the feature fusion method is unable to preserve semantic object inherent features; This study addresses the research difficulties by introducing a hybrid spectral-spatial fusion (HSSF) technique to minimize feature size while maintaining object intrinsic qualities; Lastly, a soft-margins kernel is proposed for multi-layer deep support vector machine (MLDSVM) to reduce misclassification. The standard Indian pines dataset is used for the experiment, and the outcome demonstrates that the HSSF-MLDSVM model performs substantially better in terms of accuracy and Kappa coefficient.
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with fly-by-wire flight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Root cause analysis of COVID-19 cases by enhanced text mining process
1. International Journal of Electrical and Computer Engineering (IJECE)
Vol. 12, No. 2, April 2022, pp. 1807~1817
ISSN: 2088-8708, DOI: 10.11591/ijece.v12i2.pp1807-1817 1807
Journal homepage: http://ijece.iaescore.com
Root cause analysis of COVID-19 cases by enhanced text mining
process
Sujatha Arun Kokatnoor, Balachandran Krishnan
Department of Computer Science and Engineering, CHRIST (Deemed to be University), Bengaluru, India
Article Info ABSTRACT
Article history:
Received Mar 17, 2021
Revised Aug 3, 2021
Accepted Sep 1, 2021
The main focus of this research is to find the reasons behind the fresh cases
of COVID-19 from the public’s perception for data specific to India. The
analysis is done using machine learning approaches and validating the
inferences with medical professionals. The data processing and analysis is
accomplished in three steps. First, the dimensionality of the vector space
model (VSM) is reduced with improvised feature engineering (FE) process
by using a weighted term frequency-inverse document frequency (TF-IDF)
and forward scan trigrams (FST) followed by removal of weak features
using feature hashing technique. In the second step, an enhanced K-means
clustering algorithm is used for grouping, based on the public posts from
Twitter®. In the last step, latent dirichlet allocation (LDA) is applied for
discovering the trigram topics relevant to the reasons behind the increase of
fresh COVID-19 cases. The enhanced K-means clustering improved Dunn
index value by 18.11% when compared with the traditional K-means
method. By incorporating improvised two-step FE process, LDA model
improved by 14% in terms of coherence score and by 19% and 15% when
compared with latent semantic analysis (LSA) and hierarchical dirichlet
process (HDP) respectively thereby resulting in 14 root causes for spike in
the disease.
Keywords:
Dunn index
Feature engineering
Feature hashing
Hierarchical dirichlet process
K-means
Latent dirichlet allocation
Latent semantic analysis
This is an open access article under the CC BY-SA license.
Corresponding Author:
Sujatha Arun Kokatnoor
Department of Computer Science and Engineering, School of Engineering and Technology, CHRIST
(Deemed to be University)
Bengaluru, India
Email: sujatha.ak@christuniversity.in
1. INTRODUCTION
The coronavirus is a family of viruses capable of causing a variety of diseases that are life
threatening to humans, including common and more severe forms of cold. The signs and symptoms of the
disease may occur within two to 14 days after exposure. This time referred to as the incubation period is the
time after exposure and before symptoms. The general signs and symptoms include fever, cough, tiredness,
breathing difficulty, sore throat, running nose, headache, and chest pain. Other less common signs also
include rash, nausea, vomiting, and diarrhea. Some people may only have a few symptoms and some may not
have any symptoms at all. These cases are referred to as cases, symptomatic and asymptomatic respectively
[1], [2].
As per the World Health Organization (WHO), data have shown that the virus spreads from person
to person (about 6 feet or 2 meters) among the people in close contact. The virus spreads through respiratory
droplets when someone is coughing, sneezing, or talking. Such droplets may be inhaled or landed in a nearby
person’s mouth or nose. It can also spread when a person touches a surface and touches his or her mouth,
nose, or eyes, but this is not a major way of spreading the virus as per WHO reports [1]. In the case of
2. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 12, No. 2, April 2022: 1807-1817
1808
symptoms (symptomatic), a person with the virus is the most infectious, and this is the time that they are
most likely to transmit the virus, according to the center for disease control and prevention (CDC) [2] trusted
source. But even before they start showing symptoms (asymptomatic) of the disease itself, someone can
spread it.
As per WHO for India region, there was a spike in new COVID-19 cases from 1907 to 7025
according to the data collected between 18th
June to 29th
June 2020 and to 22075 new cases as on 16th
July
2020 as shown in Figure 1. There were 2003 deaths observed as on 16th
June 2020. To understand the reason
behind the increase in new COVID-19 cases during those days, apart from the information available in WHO
[1], CDC [2], mohfw.gov.in [3] and mygov.in [4], data was collected from the online social media (OSM) as
well to understand and analyze public’s opinion on the same. Twitter was chosen for the people’s opinions as
it is one of the popular OSM as per Data Never Sleeps 7.0 report, where people post more than 500000 posts
per minute.
Figure 1. COVID-19 statistics as per World Health Organization [1]
The collected tweets were pre-processed and improvised feature engineering was applied to create
an efficient vector space model (VSM) [5]. The tweets were then clustered using the proposed enhanced K-
means clustering algorithm to group the dataset into five different clusters. It was observed manually that all
the user’s discussion on how the disease was spread was found in the third cluster and accordingly the third
cluster was considered for topic modeling. Topic modeling was done using the latent dirichlet allocation
(LDA) model [6] which identified 41 root causes through the help of trigrams. These 41 causes were
communicated to six medical professionals across Karnataka, India and thus was validated to 14 important
and prioritized root causes.
There are many clustering algorithms including partitioning, hierarchical, spectral, density based
and grid based [7]–[9]. K-means is one of the partitioning clustering algorithms. A regular K-means is
sensitive to outliers and to the contents present in it. The center or the mean of the data points belonging to
the clusters is represented by each cluster. The text data posted by users in OSM for a particular issue or
concern, when converted into a feature vector, the majority of the data points tend to be near first centroid
due to the similar contents. This indicates a drawback for the K-means clustering algorithm. For the majority
of tuples, the sum of squared error (SSE) becomes nil for the first center and only a smaller number of tuples
have high values of SSE which put these into other clusters. This is mainly because the text dataset lacks
various topics [10]. Hierarchical clustering has ambiguity in the termination process, encompasses a lot of
illogical decisions and works better with numerical datasets [11]. Density based clustering is sensitive in
setting the input parameters and has poor cluster descriptors [12]. In grid-based clustering, the boundaries of
the clusters are either vertical or horizontal and there is no detection of diagonal boundaries. Also, they are
computationally expensive for the datasets having distinct data points [13]. In this research work, an
enhanced K-means clustering was implemented to address the grouping of the input dataset into five clusters
based on the contents present in the user’s tweets.
Feature engineering was applied in two steps. In the first step, forward scan trigrams (FST)-based
term frequency and inverse document frequency (TF-IDF) was applied to reduce the high dimensional
feature vector into an efficient VSM [5]. In the second step, all the weak features present in the text dataset
were removed using the proposed feature hashing method. This VSM was input to the LDA topic model [14]
for identifying the most relevant trigram topics which were considered as root causes for the spike in new
COVID-19 cases for the dataset considered.
3. Int J Elec & Comp Eng ISSN: 2088-8708
Root cause analysis of COVID-19 cases by enhanced text mining process (Sujatha Arun Kokatnoor)
1809
2. RELATED WORK
For anticipation of the disease epidemiological trend and rate of COVID-19 in India, linear
regression (LR), multilayer perceptron (MLP) and the vector auto regression (VAR) models were used [15].
The possible COVID-19 impact trends in India were predicted on the basis of data collected from Kaggle.
The prediction model was based on the cases which were in primitive stages and the Spearman’s correlation
was used to find the similarity between the features present in the dataset [15]. As the dataset considered is
non-linear, and dependent on each other, Spearman’s Rank coefficient [16] has led to inaccurate forecasting
of the spread of the disease.
Several technologies including blockchain technology, internet of things (IoT), artificial intelligence
(AI), machine learning (ML), 5G and unmanned aerial vehicles (UAF) were used to reduce the impact of
corona virus disease outbreak by analyzing the datasets available [17]. Although studies on the
pathophysiological properties of COVID-19 exist, it remains somewhat elusive about its spreading
mechanism. Machine learning was used to quantify COVID-19 contents of establishment of health guidance,
especially vaccines amongst online opponents in [18]. User’s posts on Facebook were analyzed for both anti
vaccination and pro-vaccination communities. Snowball’s approach was used for scraping user’s posts which
discussed either vaccines or policies about vaccination or an argument on pro and anti-vaccination for the
COVID-19 disease. Later LDA algorithm was used for analyzing the appearance and involvement of topics
on COVID-19 [18]. The limitations included the study of other social media data and the feature engineering
process in dealing with the text dataset.
In [19] situational information from social media data on COVID-19 was identified, analyzed, and
classified using natural language processing techniques into seven types of situational information. They
were cautions and advice, measures taken, donations, emotional support, seeking help, criticizing and rumor
spreading. The dataset was manually labeled, and later SVM, naïve Bayes (NB), and random forest (RF)
algorithms [19] were used for the classification. The limitations of this technique are that the social media
data doesn’t come with a label and manual labeling is very time consuming and is limited to one’s domain
expertise.
To analyze multivariate time series evolution, a cluster-based method named Hierarchical clustering
was used for the COVID-19 pandemic in [20]. Countries were divided into clusters on a daily basis,
according to their cases and death numbers. Algorithmically, the total number of clusters and the membership
of individual countries were determined. This analysis gave new insights into COVID-19’s spread across
countries and through time [20]. Hierarchical clustering seldom provides the best solution, as it involves a lot
of arbitrary choices, which include the fact that it does not work with missing data, works poorly with mixed
data types, is doesn’t work well on huge data sets, and is commonly misinterpreted with its main output, the
dendrogram.
In the conventional K-means clustering process [21], significant improvements in defects of poor
results of the clustering were found due to the optimum local value and large intra-cluster variance when
calculating the density of a data set by means of a weighted distance density calculation method.
Experimental results showed that the intra-cluster variance of the clustering results was reduced in
comparison with the traditional method by applying the enhanced method which improved the performance
of the algorithm [22]. A weighted distance density method fails to identify the outliers present in the dataset.
In the data set from different regions of China, obtained from the WHO [1], the K-means clustering
based machine learning method was used. Within the original WHO data set the temperature area was
included to demonstrate the effect of temperature on each region within three separate COVID-19
perspectives–suspected, verified, and death [23]. It is observed that temperature is not the only factor for the
spread of the corona disease. There are several other factors for the spreading and if these are included as
attributes for the data analysis, a better model of avoidance can emerge.
Latent dirichlet allocation [24] was used in the grouping of similar tweets which occurred in the
same user to user communication channel in [6]. Cosine Similarity was used for extracting the topmost ten
tweets in this technique. The grouping done by considering hashtags caused duplication of the tweets and
thus took a longer training time thereby reducing the performance of the model.
The important topics posted by the public in Twitter were identified using the online LDA in [25]. A
total of twelve different topics were identified which were consolidated into four main categories: i) virus
origin, ii) virus resources, iii) virus impact factor on the public, and iv) countries and the economy and the
last category was the identification of ways of mitigating the risk of infection. The regular online LDA
uncorrelated topics could not be captured due to the topic’s distribution in the tweets collected. The number
of topics in the dataset was specified by the authors which is subjective and doesn’t always highlight the true
distribution of topics.
In [26], the public messages from Reddit based on coronavirus disease discussion were first
extracted. Then, fifty discussion topics were chosen and the data analysis was done using natural language
processing (NLP) and LDA algorithms [14]. Out of the chosen topics, three topics were found to be more
4. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 12, No. 2, April 2022: 1807-1817
1810
relevant to the corona virus discussion including public health measures, daily life impact and sense of
pandemic severity. The topics were predicted according to the multinomial distribution, followed by a further
multinomial distribution trained on a particular topic. However, this may be unsuitable if the real structure is
more complex than a multinomial distribution or if the data to be trained is not sufficient [27].
3. RESEARCH METHOD
The main objective of this research work is to identify the root causes for spike in new coronavirus
cases through identification of mode of virus transmission from one person to another using machine learning
approaches. People have expressed their views and opinions on the onset of the coronavirus in the OSM.
When the spike was observed in an increase of fresh cases, people took to OSM for expressing their views in
their domain knowledge and the reasons behind its increased cases. Those posts were extracted, the reasons
were analyzed and validated with help of medical professionals. One among the most popular OSMs is
Twitter from where tweets were extracted for the study. The extracted text has imprecise grammar, leading to
lexical, syntactic, and semantic ambiguities. This causes inappropriate analysis and identification of patterns.
The text therefore needs to be pre-processed before it can be used for clustering and followed by topic
modeling [6]. Figure 2 shows the architectural model considered for the study. After extracting the tweets
from Twitter, natural language tool kit (NLTK) 3.1 version is used for initial data preprocessing. Then the
first level of improvised feature engineering (weighted TF-IDF in combination with Forward Scan Trigrams
[5]) is applied to create an efficient VSM. This VSM is input to an enhanced K-means clustering algorithm to
yield clusters based on the similarity of the data elements. The cluster consisting of relevant root causes
addressing the coronavirus further undergoes a second level of improvised feature engineering including
removal of weak features by feature hashing method. The outcome of this step is input to the LDA Topic
Model to identify topics and its associated words on the possible root causes identified.
Figure 2. Proposed architecture for identification of root causes for spike in new COVID-19 cases
3.1. Corpus creation
As per data never sleeps 7.0 report [28], Twitter is one of the most popular OSMs where more than
500000 messages are posted by users per minute. Twint [10] comes with search filters which can be
5. Int J Elec & Comp Eng ISSN: 2088-8708
Root cause analysis of COVID-19 cases by enhanced text mining process (Sujatha Arun Kokatnoor)
1811
combined with useful methods to display precise data. Twint is a Python library that can be written to carry
out more specific and complex actions. The ability to write and scale Twitter 270 searches with Twint
enables easy but efficient data extraction from social media. Twint has the opportunity to collect information
about Twitter events of recent times.
COVID-19 statistics was taken from the WHO official website [1]. It was observed that there was a
sudden spike in new cases as shown in Figure 1 between 18th
June 2020 to 29th
June 2020, when the
Government announced Unlock 1.0 with ease of public movement restrictions. Using Twint, 7674 tweets
were collected during the mentioned duration for India region using keywords such as “spike in COVID-19”,
“spike in coronavirus”, “increase in COVID-19”, “increase in new coronavirus cases”, “root cause for corona
spread” and “recent spike in new corona cases”.
3.2. Data pre-processing
This sub-section explains the pre-processing of text data used in the research work. Twitter
comments were pre-processed before analysis in order to achieve better results. The proposed method
utilized NLP tools. For the creation of a standard text data set, the following process was used. The textual
tweets format was converted to a lower case in order to reduce text dataset volume. Punctuation has been
removed to avoid different forms of the same word and white spaces too were removed. The data set stop
words with little contribution to the calculation of the semantic quality of the document were deleted. These
stop words included “is”, “the”, “and”, “a,” “an”. The original words with similar semantic characteristics but
different forms (also called stemming words) were reduced to a common root term. For example, causes and
causing, were reduced to its root term “cause”.
3.3. Feature engineering
Feature engineering [29], [30] is the domain knowledge approach used to extract raw data features
using data mining techniques. These features can be used to enhance machine learning algorithms. In this
proposed work, the tweets extracted from users on how there was an increase in new coronavirus cases
between 18th
June to 29th
June 2020 for India specific, were converted into an efficient VSM using two steps
of feature engineering process [5]. Weighted TF-IDF with forward scan trigrams approach was used in the
first step [5] and weak features were removed using improvised feature hashing in the second step.
3.3.1. Improvised feature hashing
The second step of the feature engineering process was applied to the processed dataset. The output
of it was fed to the LDA topic model. This process helped in analyzing and using the relationship between
the topics and the words associated with it. These topics contained only those words relevant to the reasons
behind the spike in new COVID-19 cases. The second step involved removal of weak features. In a text
dataset, words with higher frequencies appear more often than words with low frequencies. The words with
low frequency are essentially poor corpus characteristics, making it a good practice in removing them. All the
weak features were removed from the preprocessed corpus in the second stage. This was accomplished by
feature hashing technique where the terms were mapped with Hash function indices. The standard default
feature size of 262,144 with Murmur3 hash function [31] was used in the study. Algorithm 1 illustrates the
steps to remove weak features using improvised feature hashing. The VSM generated after removing all the
weak features, was input to the LDA topic modeling algorithm [6].
Algorithm 1. Weal features removal using improvised feature hashing
//Input: Pre-Processed Text Corpus (Trigrams)
//Output: Corpus without Low Frequency Terms
1: for i in trigrams:
value = murmur3(i)
Calculate frequencies from the mapped indexes
2. Arrange every trigram according to its frequency calculated from Step 1
3. label(x)=1 [x is a trigram]
4. Calculate Median=(N+1)/2 [N is the total number of trigrams generated]
5. If (frequencycount(x) < Median)
label(x)=0
6. Remove x with label value=0
7: Return
3.4. Enhanced k-means clustering
K-means is an unsupervised machine learning clustering algorithm [8]. Its main task is to divide the
input dataset into K clusters based on the similarity of the data elements present in the dataset. It is
unsupervised in nature. The K-means clustering algorithm indicates that the clusters are entirely dependent
on the selection of the original clusters. K data elements as the initial centroids are selected and Euclidean
6. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 12, No. 2, April 2022: 1807-1817
1812
distance measure calculates the distances of all the data elements with the chosen centroids. Data elements
less far from centroids are moved to the corresponding cluster. The process continues until no further
changes in the centroids take place. The conventional K-means algorithm [8], however, is computationally
expensive since centroids provide results in the quality of the clusters and converge to minimum local levels.
Conventional K-means is also sensitive to original seed value (likely to be k!/kk
where k is the total number
of clusters chosen) as well as outliers in the dataset.
The initial centroids are systematically determined to produce the clusters more accurately during
the first phase, while in the second phase, data points are assigned to the appropriate clusters described in the
proposed algorithm. These two phases are iterated for k=√
𝑛
2
times, where n is the total number of feature
vectors. In each iteration the Dunn index (DI) [32] is calculated using the equation 1 and the details of
clusters and the DI values are stored in temporary lists. The total number of clusters and their details are
considered for further analysis based on the highest DI value.
𝐷𝐼 =
𝑀𝑖𝑛𝑉𝑎𝑙𝑢𝑒 (𝐼𝑛𝑡𝑒𝑟 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒)
𝑀𝑎𝑥𝑉𝑎𝑙𝑢𝑒 (𝐼𝑛𝑡𝑟𝑎 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒) (1)
Algorithm 2 illustrates the procedure of enhanced k-means clustering algorithm. calculate_centroids and
determine_clusters are shown in Figure 3 and Figure 4, respectively.
Algorithm 1. Enhanced k-means clustering algorithm
//Input: Pre-Processed Text Corpus
//Output: K clusters
1: K=√(n/2)
2: Initial Cluster C={Random Data Element}
3: for i=1 to K:
C1=Calculate_Centroids (C)
C2=Determine_Clusters (C1)
Calculate DI and store DI value and the corresponding clusters details in a temporary list
4: for i=1 to K:
M=Choose Maximum DI
5: Return the Clusters and the corresponding data elements for the M
Figure 3. Flowchart for calculate_centroids
7. Int J Elec & Comp Eng ISSN: 2088-8708
Root cause analysis of COVID-19 cases by enhanced text mining process (Sujatha Arun Kokatnoor)
1813
Figure 4. Flowchart to determine_clusters
3.5. Latent dirichlet allocation algorithm (LDA)
The VSM obtained after the feature engineering process as described in the Section 4.3.1 was fed as
input to LDA topic modeling model [6] for further analysis. LDA is an unsupervised machine learning
technique generally used for analyzing text documents for determining cluster words for a particular topic. In
this paper, LDA topic modeling has been used to determine the reasons behind the increase in new
COVID-19 cases between certain periods of time as per WHO reports based on the public discussions in
OSM.
4. RESULTS AND DISCUSSION
This section provides a performance assessment of the research methodology. The data for the study
included Indian public discussions on the day-to-day increase in COVID-19 cases on Twitter. The tweets
were collected between 18th
June and 29th
June 2020, using a Python library named Twint as explained in
section 3.1. The tweets were further processed using NLP tools and the feature engineering process as
discussed in sections 3.2 and 3.3.
NLTK 3.1 [33] was used for building python programs as the work involved human generated
textual dataset. After pre-processing the dataset, the input was fed to enhanced K-means clustering algorithm
as discussed in section 3.4. Initially, regular K-means clustering [8] was applied with K=5 clusters. It was
observed that there was overlapping of the clusters and the data points were not distributed accurately.
Although K-means was iterated for 200 times, no changes were observed in the centroids. The pre-processed
dataset was input to the enhanced K-means algorithm. It was observed that the cluster centroids were well
separated in enhanced K-means clustering when compared to the conventional one.
DI [32] was used to validate and identify sets of clusters which were compact, with small variations
between cluster members and with sufficient distance between other clusters centroids. The optimal number
of clusters K=5 was chosen for this research work with maximum Dunn Index value (0.2797) as shown in
Figure 5. The DI for conventional K-means with K=5 resulted in 0.0986. It was observed from clustering
results done through enhanced K-means clustering algorithm that only one cluster among 5 clusters contained
the reasons for increase in new cases as posted by the public. The cluster comprising root causes was
validated through a statistical measure namely Cohen’s Kappa measure of agreement [34]. Out of
8. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 12, No. 2, April 2022: 1807-1817
1814
7,674 tweets collected through manual annotation, it was found that 1639 tweets consisted of reasons for
spike in the disease. But through the experimentation process, it was observed that 1,335 data elements were
clustered consisting of the root causes in the 3rd cluster. As the value of C was 0.92, there was a strong
agreement between the manual identification and the experimental results.
The third cluster after removal of the weak features through Feature Hashing Process as discussed in
section 3.3.1, was input to the LDA model. Number of topics chosen were 10 and the experimentation
process resulted in 300 trigrams with 30 trigrams per topic. The following 41 reasons as listed in Table 1,
were found to be the root causes for a sudden spike in new COVID-19 cases between the specified time
duration as per public’s opinion and discussion in OSM. These reasons were communicated to six medical
professionals for the validation process where they consolidated the reasons to 14 main root causes which are
shown in Table 2.
Figure 5. Validation of clusters with Dunn index
Table 1. LDA consolidated results 41 root causes for COVID-19 increase
Identified root causes for a sudden spike in New COVID-19 cases
1. More family gathering
2. No proper social distancing
3. Low quality sanitizers
4. Poor quality masks
5. Protests are leading spikes
6. Hospital lock down
7. Late visiting Hospital
8. Less people coordination
9. People mislead others
10. Do not follow home quarantine
11. Do not wear mask while outside
12. Infected-late presentation
13. Spread across community
14. Regular working people
15. People moving freely
16. Do not Care other People
17. Hospital was seized
18. Hospitals are closed
19. Police and health
20. People are scared
21. Avoid hospital checkup
22. Social stigma result
23. Scared social stigma
24. Avoiding barricade during covid
25. High influential people
26. Put proper mask
27. How to wear mask
28. Remove mask while outside
29. Late visiting to hospital
30. Early intervention and self-medication
31. Social workers lack
32. Blame game attacking
33. to work together
34. Less people coordination
35. Mislead other causes spikes
36. Permanent Doctors Nurses
37. Lack of doctors nurses
38. Democrats keep yelling
39. Lying about corona
40. Hospital Lock Down
41. Panic other group
Table 2. Consolidated root causes validated by medical professionals
Consolidated Root Causes for New COVID-19 Cases as Validated by Medical Professionals
1. Family gatherings-**
2. No proper social distance-*****
3. Low quality sanitizers- *
4. Low quality masks-*
5. Not wearing masks appropriately when in public-****
6. Removing masks while speaking-***
7. Hospital lockdowns due to a covid case thereby preventing treatment for rest identified-*
8. Group protests, people are moving freely, do not care about other people, avoiding home quarantine-***
9. Self-intervention, hiding the illness, social stigma, using highly influential people for hiding the disease, late visit to the hospitals,
lying about their illness, avoiding barricades.-***
10. Lack of social workers, nurses, doctors and health facility-*
11. Misleading and panicking other groups of people-**
12. Lack of people coordination while treating people with Covid-*
13. Infections from police and health care sectors-*
14. No regular working people like nurses doctors for serving people who have corona-*
9. Int J Elec & Comp Eng ISSN: 2088-8708
Root cause analysis of COVID-19 cases by enhanced text mining process (Sujatha Arun Kokatnoor)
1815
Each * provided by the medical professionals specified an extra weightage for the consolidated
point mentioned, where * has least weightage and ***** has highest weightage, respectively. Feature
Engineering through TF-IDF+forward scan trigrams [5] and removal of weak features through Feature
Hashing has helped improve the model’s performance by 12% in terms of coherence scores. The coherence
score was used in the experimentation process for assessing the quality of the identified topics. It outputs a
value for each topic by calculating the degree of similar semantic words for the chosen topic. The comparison
results are shown in Table 3. Figure 6 gives the comparative analysis of LDA with latent semantic analysis
(LSA) and hierarchical dirichlet process (HDP) topic models. LSA partially captured polysemy with several
symbolic significance. It worked especially successfully for long documents because, based on their
contextual meaning, a small number of vectors represented each document. Due to the high volume of data
which lowered the efficiency of LSA, however, more storage and calculation times were consumed. The
findings of the LSA were also found difficult to understand terms related to topics. The HDP calculated a
simple probability and required various multinomial distributions, leading to an infinite number of topics.
This allowed the number of samples taken into account to be increased or decreased. HDP restricted its
option to an optimum degree of granularity for the number of topics, because of which it performed less.
LDA included documents relevant to a certain number of topics and that each document derived from a
combination of probabilistic samples. First the distribution of possible topics for the dataset was considered
and second the list of potential words in a chosen topic was considered. Also, LDA easily distinguished
between global and local variables. The global variables are topics and local variables are document level
indices.
Table 3. Comparative coherence values for LDA model with feature engineering process
No. of Topics Feature Engineering 1st Step:
TF-IDF+Forward Scan Trigrams
Feature Engineering
1st Step+Feature Hashing
10 0.5302 0.6158
Figure 6. Comparative analysis of LDA with LSA and HDP topic models with and without feature
engineering process
5. CONCLUSION
The analysis and description of the transmission of coronavirus disease by considering public posts
in Twitter and its validation with the aid of medical professionals was a major focus of this article. The
overview description provided a better understanding of COVID-19 and its exponential growth in the recent
days. The Twitter®
posts analysis on the disease provided relevant and useful topics in three-word sequences
which specified the probable root causes of its transmission from one person to another.
In this proposed work, a three-step model was proposed. In the first step, the text corpus extracted
between 18th
June and 29th
June 2020 consisting of COVID-19 discussions and their propagation were
preprocessed using NLP tools and feature engineering process. The calculation of trigrams (forward scan
trigrams) was modified by considering the probabilities of not only the words in backward direction for a
target word for a given text but also the probabilities of words in the forward direction as well. This
modification helped in retaining the context and meaning of the sequence considered as the user’s text is a
natural language text written naturally without considering the grammar aspects. This modified information
was passed on to the weighted TF-IDF model thereby reducing the dimensionality of the VSM and was
further processed by an additional task of removing weak features based on a process combining feature
10. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 12, No. 2, April 2022: 1807-1817
1816
hashing and a median value. This improved the detection of hidden topics with the simultaneous increase of
the coherence Scores of the model. The general variance model added more weight for the outliers because
they were far from the calculated mean, and median value was therefore taken into consideration for the
removal of weaker features. This step enhanced the proposed model’s performance by 12% in terms of
coherence Scores. In the second step, an enhanced K-means clustering algorithm was applied for calculation
of the centroids thereby grouping the processed dataset into clusters. The cluster validation was done through
DI and five clusters were chosen based on the highest DI value obtained (0.2797). In the last step, LDA
Topic Modeling was used for identifying all the root causes relevant for increase in new infected cases of
coronavirus. Out of five clusters resulted from the second step, only one cluster attributed to the root causes
and the same was validated through Cohen's kappa coefficient which resulted in 92% indicating a strong
agreement between the manual identification and the experimental results. The last step of the proposed
model yielded 43 root causes out of which 41 causes were validated by five medical professionals across
Karnataka. The other two reasons (‘recent travel history’ and ‘direct contact infected’) were validated with
WHO reports. The 41 reasons considered for the spread in disease were further consolidated to 14 prioritized
root causes by the medical professionals. The future scope may include automated identification of root
causes not only from the textual posts, but also from the emoticons, images and videos posted by the public
in OSM.
ACKNOWLEDGEMENTS
Authors wish to acknowledge the technical and infrastructural help rendered by the faculty members
of CSE department of CHRIST (Deemed to be University), Bangalore, India. The authors would also like to
thank the Medical Practitioners across Karnataka, India, for providing a detailed discussion and report on
what COVID-19 is, its possibility of transmission from one to another living being and the necessary
precautions to be taken.
REFERENCES
[1] “Coronavirus disease (COVID-19) pandemic,” World Health Organization. https://www.who.int (accessed Mar. 2020).
[2] “Coronavirus (COVID-19),” Centers for Disease Control and Prevention. https://www.cdc.gov/ (accessed Jul. 2020).
[3] “Coronavirus (COVID-19),” Ministry of Health and Family Welfare. https://www.mohfw.gov.in/ (accessed Jul. 2020).
[4] “Coronavirus (COVID-19),” Government Online Services. https://www.mygov.in/ (accessed March 2020).
[5] S. A. Kokatnoor and B. Krishnan, “A two-stepped feature engineering process for topic modeling using batchwise LDA with
stochastic variational inference model,” International Journal of Intelligent Engineering and Systems, vol. 13, no. 4, pp. 333–345,
Aug. 2020, doi: 10.22266/IJIES2020.0831.29.
[6] D. Alvarez-Melis and M. Saveski, “Topic modeling in Twitter: aggregating tweets by conversations,” in ICWSM, 2016, vol. 20,
no. 1.
[7] S. Kodati, R. Vivekanandam, and G. Ravi, “Comparative analysis of clustering algorithms with heart disease datasets using data
mining weka tool,” in Advances in Intelligent Systems and Computing, vol. 900, Springer Singapore, 2019, pp. 111–117.
[8] M. Z. Rodriguez et al., “Clustering algorithms: a comparative approach,” PLoS ONE, vol. 14, no. 1, Jan. 2019,
Art. no. 0210236,doi: 10.1371/journal.pone.0210236.
[9] J. Khalfallah and J. B. H. Slama, “A comparative study of the various clustering algorithms in e-learning systems using weka
tools,” in Proceedings of 2018 JCCO Joint International Conference on ICT in Education and Training, International Conference
on Computing in Arabic, and International Conference on Geocomputing, JCCO: TICET-ICCA-GECO 2018, Nov. 2018,
pp. 76–82, doi: 10.1109/ICCA-TICET.2018.8726188.
[10] S. A. Kokatnoor and B. Krishnan, “Self-supervised learning based anomaly detection in online social media,” International
Journal of Intelligent Engineering and Systems, vol. 13, no. 3, pp. 446–456, Jun. 2020, doi: 10.22266/IJIES2020.0630.40.
[11] F. Ros and S. Guillaume, “A hierarchical clustering algorithm and an improvement of the single linkage criterion to deal with
noise,” Expert Systems with Applications, vol. 128, pp. 96–108, Aug. 2019, doi: 10.1016/j.eswa.2019.03.031.
[12] A. Sharma, R. K. Gupta, and A. Tiwari, “Improved density based spatial clustering of applications of noise clustering algorithm
for knowledge discovery in spatial data,” Mathematical Problems in Engineering, vol. 2016, pp. 1–9, 2016,
doi: 10.1155/2016/1564516.
[13] B. Wu and B. M. Wilamowski, “A fast density and grid based clustering method for data with arbitrary shapes and noise,” IEEE
Transactions on Industrial Informatics, vol. 13, no. 4, pp. 1620–1628, Aug. 2017, doi: 10.1109/TII.2016.2628747.
[14] W. Anwar, I. S. Bajwa, M. A. Choudhary, and S. Ramzan, “An empirical study on forensic analysis of Urdu text using LDA-
based authorship attribution,” IEEE Access, vol. 7, pp. 3224–3234, 2019, doi: 10.1109/ACCESS.2018.2885011.
[15] R. Sujath, J. M. Chatterjee, and A. E. Hassanien, “A machine learning forecasting model for COVID-19 pandemic in India,”
Stochastic Environmental Research and Risk Assessment, vol. 34, no. 7, pp. 959–972, May 2020,
doi: 10.1007/s00477-020-01827-8.
[16] P. Radhakrishnan and B. Vignesh, “A note on rank correlation and semi-supervised machine learning based measure,” in 2017
Innovations in Power and Advanced Computing Technologies, i-PACT 2017, Apr. 2017, vol. 2017-January, pp. 1–8,
doi: 10.1109/IPACT.2017.8245035.
[17] V. Chamola, V. Hassija, V. Gupta, and M. Guizani, “A comprehensive review of the COVID-19 pandemic and the role of IoT,
drones, AI, blockchain, and 5G in managing its impact,” IEEE Access, vol. 8, pp. 90225–90265, 2020,
doi: 10.1109/ACCESS.2020.2992341.
[18] R. F. Sear et al., “Quantifying COVID-19 content in the online health opinion war using machine learning,” IEEE Access, vol. 8,
pp. 91886–91893, 2020, doi: 10.1109/ACCESS.2020.2993967.
11. Int J Elec & Comp Eng ISSN: 2088-8708
Root cause analysis of COVID-19 cases by enhanced text mining process (Sujatha Arun Kokatnoor)
1817
[19] L. Li et al., “Characterizing the propagation of situational information in social media during COVID-19 epidemic: a case study
on Weibo,” IEEE Transactions on Computational Social Systems, vol. 7, no. 2, pp. 556–562, Apr. 2020,
doi: 10.1109/TCSS.2020.2980007.
[20] N. James and M. Menzies, “Cluster-based dual evolution for multivariate time series: analyzing COVID-19,” Chaos, vol. 30,
no. 6, Jun. 2020, Art. no. 61108, doi: 10.1063/5.0013156.
[21] Y. Xu, X. Fu, H. Li, G. Dong, and Q. Wang, “A k-means algorithm based on feature weighting,” MATEC Web of Conferences,
vol. 232, 2018, Art. no. 3005doi: 10.1051/matecconf/201823203005.
[22] W. Yang, H. Long, L. Ma, and H. Sun, “Research on clustering method based on weighted distance density and k-means,”
Procedia Computer Science, vol. 166, pp. 507–511, 2020, doi: 10.1016/j.procs.2020.02.056.
[23] M. K. Siddiqui et al., “Correlation between temperature and COVID-19 (suspected, confirmed and death) cases based on machine
learning analysis,” Journal of Pure and Applied Microbiology, vol. 14, no. suppl 1, pp. 1017–1024, Apr. 2020,
doi: 10.22207/JPAM.14.SPL1.40.
[24] L. Zou and W. W. Song, “LDA-TM: a two-step approach to Twitter topic data clustering,” in Proceedings of 2016 IEEE
International Conference on Cloud Computing and Big Data Analysis, ICCCBDA 2016, Jul. 2016, pp. 342–347,
doi: 10.1109/ICCCBDA.2016.7529581.
[25] A. Abd-Alrazaq, D. Alhuwail, M. Househ, M. Hai, and Z. Shah, “Top concerns of tweeters during the COVID-19 pandemic: a
surveillance study,” Journal of Medical Internet Research, vol. 22, no. 4, Apr. 2020, doi: 10.2196/19016.
[26] D. C. Stokes, A. Andy, S. C. Guntuku, L. H. Ungar, and R. M. Merchant, “Public priorities and concerns regarding COVID-19 in
an online discussion forum: longitudinal topic modeling,” Journal of General Internal Medicine, vol. 35, no. 7, pp. 2244–2247,
May 2020, doi: 10.1007/s11606-020-05889-w.
[27] B. X. Tran et al., “Studies of novel coronavirus disease 19 (COVID-19) pandemic: a global analysis of literature,” International
Journal of Environmental Research and Public Health, vol. 17, no. 11, pp. 1–20, Jun. 2020, doi: 10.3390/ijerph17114095.
[28] “Data Never Sleeps 7.0,” Domo, 2020. https://www.domo.com/learn/data-never-sleeps-7 (accessed Dec. 2020).
[29] R. N. Waykole and A. D. Thakare, “A review of feature extraction methods for text classification,” International Journal of
Advance Engineering and Research Development, vol. 5, no. 4, pp. 351–354, 2018
[30] F. P. Shah and V. Patel, “A review on feature selection and feature extraction for text classification,” in Proceedings of the 2016
IEEE International Conference on Wireless Communications, Signal Processing and Networking, WiSPNET 2016, Mar. 2016,
pp. 2264–2268, doi: 10.1109/WiSPNET.2016.7566545.
[31] M. Jiang, C. Zhao, Z. Mo, and J. Wen, “An improved algorithm based on bloom filter and its application in bar code recognition
and processing,” Eurasip Journal on Image and Video Processing, vol. 2018, no. 1, Dec. 2018, doi: 10.1186/s13640-018-0375-6.
[32] M. Misuraca, M. Spano, and S. Balbi, “BMS: an improved dunn index for document clustering validation,” Communications in
Statistics - Theory and Methods, vol. 48, no. 20, pp. 5036–5049, Oct. 2019, doi: 10.1080/03610926.2018.1504968.
[33] S. Seshathriaathithyan, M. V. Sriram, S. Prasanna, and R. Venkatesan, “Affective - hierarchical classification of text - An
approach using NLP toolkit,” in Proceedings of IEEE International Conference on Circuit, Power and Computing Technologies,
ICCPCT 2016, Mar. 2016, pp. 1–6, doi: 10.1109/ICCPCT.2016.7530228.
[34] R. Adhitama, R. Kusumaningrum, and R. Gernowo, “Topic labeling towards news document collection based on latent dirichlet
allocation and ontology,” in Proceedings - 2017 1st International Conference on Informatics and Computational Sciences,
ICICoS 2017, Nov. 2017, pp. 247–251, doi: 10.1109/ICICOS.2017.8276370.
BIOGRAPHIES OF AUTHORS
Sujatha Arun Kokatnoor has completed her Doctoral Thesis from CHRIST
(Deemed to be University) in the Machine Learning Area of Anomaly Detection in Social
Media Networks. She is currently working as an Assistant Professor in the Department of
Computer Science and Engineering, in CHRIST (Deemed to be University). Her area of
research includes in the field of data science, machine learning, and deep learning. He can
be contacted at email: sujatha.ak@christuniversity.in.
Balachandran Krishnan has completed his Doctoral Thesis from Anna
University in the Data Mining area of Ensemble Modeling. He is currently working as a
Professor in the Department of Computer Science and Engineering, in CHRIST (Deemed
to be University). His area of research includes data science, image processing and parallel
computing. He can be contacted at email: balachandran.k@christuniversity.in.