Statisticians in both academia and industry have encountered problems with high-dimensional data. The rapid feature increase has caused the feature count to outstrip the instance count. There are several established methods when selecting features from massive amounts of breast cancer data. Even so, overfitting continues to be a problem. The challenge of choosing important features with minimum loss in a different sample size is another area with room for development. As a result, the feature selection technique is crucial for dealing with high-dimensional data classification issues. This paper proposed a new architecture for high-dimensional breast cancer data using filtering techniques and a logistic regression model. Essential features are filtered out using a combination of hybrid chiโsquare and hybrid information gain (hybrid IG) with logistic regression as classifier. The results showed that hybrid IG performed the best for high-dimensional breast and prostate cancer data. The top 50 and 22 features outperformed the other configurations, with the highest classification accuracies of 86.96% and 82.61%, respectively, after integrating the hybrid information gain and logistic function (hybrid IG+LR) with a sample size of 75. In the future, multiclass classification of multidimensional medical data to be evaluated using data from a different domain.
Breast cancer is the leading cause of death for women worldwide. Cancer can be discovered early, lowering the rate of death. Machine learning techniques are a hot field of research, and they have been shown to be helpful in cancer prediction and early detection. The primary purpose of this research is to identify which machine learning algorithms are the most successful in predicting and diagnosing breast cancer, according to five criteria: specificity, sensitivity, precision, accuracy, and F1 score. The project is finished in the Anaconda environment, which uses Python's NumPy and SciPy numerical and scientific libraries as well as matplotlib and Pandas. In this study, the Wisconsin diagnostic breast cancer dataset was used to evaluate eleven machine learning classifiers: decision tree, quadratic discriminant analysis, AdaBoost, Bagging meta estimator, Extra randomized trees, Gaussian process classifier, Ridge, Gaussian nave Bayes, k-Nearest neighbors, multilayer perceptron, and support vector classifier. During performance analysis, extremely randomized trees outperformed all other classifiers with an F1-score of 96.77% after data collection and data analysis.
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
ย
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%.
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
ย
Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly.
For more information:
http://societyofdatascientists.com/controlling-informative-features-for-improved-accuracy-and-faster-predictions-in-omentum-cancer-models/?src=slideshare
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATAcscpconf
ย
Many studies uses different data mining techniques to analyze mass spectrometry data and
extract useful knowledge about biomarkers. These Biomarkers allow the medical experts to
determine whether an individual has a disease or not. Some of these studies have proposed
models that have obtained high accuracy. However, the black-box nature and complexity of the
proposed models have posed significant issues. Thus, to address this problem and build an
accurate model, we use a genetic algorithm for feature selection along with a rule-based
classifier, namely Genetic Rule-Based Classifier algorithm for Mass Spectra data (GRC-MS).
According to the literature, rule-based classifiers provide understandable rules, but not
accurate. In addition, genetic algorithms have achieved excellent results when used with
different classifiers for feature selection. Experiments are conducted on real dataset and the
proposed classifier GRC-MS achieves 99.7% accuracy. In addition, the generated rules are
more understandable than those of other classifier models
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
ย
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second
most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast
cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast
cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data
processing tools, we tackled this disease analysis. Data mining is an important step of library discovery
where intelligent methods are used to detect patterns. Several clinical breast cancer studies were
conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier,
easier, or more comprehensive than others. This research is focused on genetic programming and machine
learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise
the testing algorithm. We used genetic programming methods to choose classification machines' best
features and parameter values. Data mining is an important step of library discovery where intelligent
methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A
comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F.
Tree processes, i.e. 97.71%.
Breast cancer is the leading cause of death for women worldwide. Cancer can be discovered early, lowering the rate of death. Machine learning techniques are a hot field of research, and they have been shown to be helpful in cancer prediction and early detection. The primary purpose of this research is to identify which machine learning algorithms are the most successful in predicting and diagnosing breast cancer, according to five criteria: specificity, sensitivity, precision, accuracy, and F1 score. The project is finished in the Anaconda environment, which uses Python's NumPy and SciPy numerical and scientific libraries as well as matplotlib and Pandas. In this study, the Wisconsin diagnostic breast cancer dataset was used to evaluate eleven machine learning classifiers: decision tree, quadratic discriminant analysis, AdaBoost, Bagging meta estimator, Extra randomized trees, Gaussian process classifier, Ridge, Gaussian nave Bayes, k-Nearest neighbors, multilayer perceptron, and support vector classifier. During performance analysis, extremely randomized trees outperformed all other classifiers with an F1-score of 96.77% after data collection and data analysis.
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
ย
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%.
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
ย
Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly.
For more information:
http://societyofdatascientists.com/controlling-informative-features-for-improved-accuracy-and-faster-predictions-in-omentum-cancer-models/?src=slideshare
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATAcscpconf
ย
Many studies uses different data mining techniques to analyze mass spectrometry data and
extract useful knowledge about biomarkers. These Biomarkers allow the medical experts to
determine whether an individual has a disease or not. Some of these studies have proposed
models that have obtained high accuracy. However, the black-box nature and complexity of the
proposed models have posed significant issues. Thus, to address this problem and build an
accurate model, we use a genetic algorithm for feature selection along with a rule-based
classifier, namely Genetic Rule-Based Classifier algorithm for Mass Spectra data (GRC-MS).
According to the literature, rule-based classifiers provide understandable rules, but not
accurate. In addition, genetic algorithms have achieved excellent results when used with
different classifiers for feature selection. Experiments are conducted on real dataset and the
proposed classifier GRC-MS achieves 99.7% accuracy. In addition, the generated rules are
more understandable than those of other classifier models
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
ย
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second
most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast
cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast
cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data
processing tools, we tackled this disease analysis. Data mining is an important step of library discovery
where intelligent methods are used to detect patterns. Several clinical breast cancer studies were
conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier,
easier, or more comprehensive than others. This research is focused on genetic programming and machine
learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise
the testing algorithm. We used genetic programming methods to choose classification machines' best
features and parameter values. Data mining is an important step of library discovery where intelligent
methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A
comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F.
Tree processes, i.e. 97.71%.
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
ย
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast
cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise
the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
ย
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine
learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%.
Performance evaluation of random forest with feature selection methods in pre...IJECEIAES
ย
Data mining is nothing but the process of viewing data in different angle and compiling it into appropriate information. Recent improvements in the area of data mining and machine learning have empowered the research in biomedical field to improve the condition of general health care. Since the wrong classification may lead to poor prediction, there is a need to perform the better classification which further improves the prediction rate of the medical datasets. When medical data mining is applied on the medical datasets the important and difficult challenges are the classification and prediction. In this proposed work we evaluate the PIMA Indian Diabtes data set of UCI repository using machine learning algorithm like Random Forest along with feature selection methods such as forward selection and backward elimination based on entropy evaluation method using percentage split as test option. The experiment was conducted using R studio platform and we achieved classification accuracy of 84.1%. From results we can say that Random Forest predicts diabetes better than other techniques with less number of attributes so that one can avoid least important test for identifying diabetes.
An efficient feature selection algorithm for health care data analysisjournalBEEI
ย
Diabete is a silent killer, which will slowly kill the person if it goes undetected. The existing system which uses F-score method and K-means clustering of checking whether a person has diabetes or not are 100% accurate, and anything which isn't a 100% is not acceptable in the medical field, as it could cost the lives of many people. Our proposed system aims at using some of the best features of the existing algorithms to predict diabetes, and combine these and based on these features; This research work turns them into a novel algorithm, which will be 100% accurate in its prediction. With the surge in technological advancements, we can use data mining to predict when a person would be diagnosed with diabetes. Specifically, we analyze the best features of chi-square algorithm and advanced clustering algorithm (ACA). This research work is done using the Pima Indian Diabetes dataset provided by National Institutes of Diabetes and Digestive and Kidney Diseases. Using classification theorems and methods we can consider different factors like age, BMI, blood pressure and the importance given to these attributes overall, and singles these attributes out, and use them for the prediction of diabetes.
A new model for large dataset dimensionality reduction based on teaching lear...TELKOMNIKA JOURNAL
ย
One of the human diseases with a high rate of mortality each year is breast cancer (BC). Among all the forms of cancer, BC is the commonest cause of death among women globally. Some of the effective ways of data classification are data mining and classification methods. These methods are particularly efficient in the medical field due to the presence of irrelevant and redundant attributes in medical datasets. Such redundant attributes are not needed to obtain an accurate estimation of disease diagnosis. Teaching learning-based optimization (TLBO) is a new metaheuristic that has been successfully applied to several intractable optimization problems in recent years. This paper presents the use of a multi-objective TLBO algorithm for the selection of feature subsets in automatic BC diagnosis. For the classification task in this work, the logistic regression (LR) method was deployed. From the results, the projected method produced better BC dataset classification accuracy (classified into malignant and benign). This result showed that the projected TLBO is an efficient features optimization technique for sustaining data-based decision-making systems.
Multivariate sample similarity measure for feature selection with a resemblan...IJECEIAES
ย
Feature selection improves the classification performance of machine learning models. It also identifies the important features and eliminates those with little significance. Furthermore, feature selection reduces the dimensionality of training and testing data points. This study proposes a feature selection method that uses a multivariate sample similarity measure. The method selects features with significant contributions using a machine-learning model. The multivariate sample similarity measure is evaluated using the University of California, Irvine heart disease dataset and compared with existing feature selection methods. The multivariate sample similarity measure is evaluated with metrics such as minimum subset selected, accuracy, F1-score, and area under the curve (AUC). The results show that the proposed method is able to diagnose chest pain, thallium scan, and major vessels scanned using X-rays with a high capability to distinguish between healthy and heart disease patients with a 99.6% accuracy.
Exploring the performance of feature selection method using breast cancer dat...nooriasukmaningtyas
ย
Breast cancer is the most common type of cancer occurring mostly in females. In recent years, many researchers have devoted to automate diagnosis of breast cancer by developing different machine learning model. However, the quality and quantity of feature in breast cancer diagnostic dataset have significant effect on the accuracy and efficiency of predictive model. Feature selection is effective method for reducing the dimensionality and improving the accuracy of predictive model. The use of feature selection is to determine feature required for training model and to remove irrelevant and duplicate feature. Duplicate feature is a feature that is highly correlated to another feature. The objective of this study is to conduct experimental research on three different feature selection methods for breast cancer prediction. Sequential, embedded and chi-square feature selection are implemented using breast cancer diagnostic dataset. The study compares the performance of sequential embedded and chi-square feature selection on test set. The experimental result evidently shows that sequential feature selection outperforms as compared to chi-square (X2) statistics and embedded feature selection. Overall, sequential feature selection achieves better accuracy of 98.3% as compared to chi-square (X2) statistics and embedded feature selection.
Classification AlgorithmBased Analysis of Breast Cancer DataIIRindia
ย
The classification algorithms are very frequently used algorithms for analyzing various kinds of data available in different repositories which have real world applications. The main objective of this research work is to find the performance of classification algorithms in analyzing Breast Cancer data via analyzing the mammogram images based its characteristics.Different attribute values of cancer affected mammogram images are considered for analysis in this work. The Patients food habits, age of the patients, their life styles, occupation, their problem about the diseases and other information are taken into account for classification. Finally, performance of classification algorithms J48, CART and ADTree are given with its accuracy. The accuracy of taken algorithms is measured by various measures like specificity, sensitivity and kappa statistics (Errors).
Enhancing feature selection with a novel hybrid approach incorporating geneti...IJECEIAES
ย
Computing advances in data storage are leading to rapid growth in large-scale datasets. Using all features increases temporal/spatial complexity and negatively influences performance. Feature selection is a fundamental stage in data preprocessing, removing redundant and irrelevant features to minimize the number of features and enhance the performance of classification accuracy. Numerous optimization algorithms were employed to handle feature selection (FS) problems, and they outperform conventional FS techniques. However, there is no metaheuristic FS method that outperforms other optimization algorithms in many datasets. This motivated our study to incorporate the advantages of various optimization techniques to obtain a powerful technique that outperforms other methods in many datasets from different domains. In this article, a novel combined method GASI is developed using swarm intelligence (SI) based feature selection techniques and genetic algorithms (GA) that uses a multi-objective fitness function to seek the optimal subset of features. To assess the performance of the proposed approach, seven datasets have been collected from the UCI repository and exploited to test the newly established feature selection technique. The experimental results demonstrate that the suggested method GASI outperforms many powerful SI-based feature selection techniques studied. GASI obtains a better average fitness value and improves classification performance.
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...rahulmonikasharma
ย
Classification problems in high dimensional information with little sort of observations became furthercommon significantly in microarray information. The increasing amount of text data on internet sites affects the agglomerationanalysis. The text agglomeration could also be a positive analysis technique used for partitioning a huge amount of datainto clusters. Hence, the most necessary draw back that affects the text agglomeration technique is that the presenceuninformative and distributed choices in text documents. A broad class of boosting algorithms is known as actingcoordinate-wise gradient descent to attenuate some potential performs of the margins of a data set. This paperproposes a novel analysis live Q-statistic that comes with the soundness of the chosen feature set to boot to theprediction accuracy. Then we've a bent to propose the Booster of associate degree FS algorithm that enhances theworth of the Q-statistic of the algorithm applied.
A Comparative Analysis of Hybridized Genetic Algorithm in Predictive Models o...Shakas Technologies
ย
A Comparative Analysis of Hybridized Genetic Algorithm in Predictive Models of Breast Cancer Tumors.
Shakas Technologies ( Galaxy of Knowledge)
#11/A 2nd East Main Road,
Gandhi Nagar,
Vellore - 632006.
Mobile : +91-9500218218 / 8220150373| land line- 0416- 3552723
Shakas Training & Development | Shakas Sales & Services | Shakas Educational Trust|IEEE projects | Research & Development | Journal Publication |
Email : info@shakastech.com | shakastech@gmail.com |
website: www.shakastech.comโ
Facebook: https://www.facebook.com/pages/Shakas-Technologies
Evolving Efficient Clustering and Classification Patterns in Lymphography Dat...ijsc
ย
Data mining refers to the process of retrieving knowledge by discovering novel and relative patterns from large datasets. Clustering and Classification are two distinct phases in data mining that work to provide an established, proven structure from a voluminous collection of facts. A dominant area of modern-day research in the field of medical investigations includes disease prediction and malady categorization. In this paper, our focus is to analyze clusters of patient records obtained via unsupervised clustering techniques and compare the performance of classification algorithms on the clinical data. Feature selection is a supervised method that attempts to select a subset of the predictor features based on the information gain. The Lymphography dataset comprises of 18 predictor attributes and 148 instances with the class label having four distinct values. This paper highlights the accuracy of eight clustering algorithms in detecting clusters of patient records and predictor attributes and highlights the performance of sixteen classification algorithms on the Lymphography dataset that enables the classifier to accurately perform multi-class categorization of medical data. Our work asserts the fact that the Random Tree algorithm and the Quinlanโs C4.5 algorithm give 100 percent classification accuracy with all the predictor features and also with the feature subset selected by the Fisher Filtering feature selection algorithm.. It is also stated here that the Density Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm offers increased clustering accuracy in less computation time.
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...ijsc
ย
Data mining refers to the process of retrieving knowledge by discovering novel and relative patterns from
large datasets. Clustering and Classification are two distinct phases in data mining that work to provide an
established, proven structure from a voluminous collection of facts. A dominant area of modern-day
research in the field of medical investigations includes disease prediction and malady categorization. In
this paper, our focus is to analyze clusters of patient records obtained via unsupervised clustering
techniques and compare the performance of classification algorithms on the clinical data. Feature
selection is a supervised method that attempts to select a subset of the predictor features based on the
information gain. The Lymphography dataset comprises of 18 predictor attributes and 148 instances with
the class label having four distinct values. This paper highlights the accuracy of eight clustering algorithms
in detecting clusters of patient records and predictor attributes and highlights the performance of sixteen
classification algorithms on the Lymphography dataset that enables the classifier to accurately perform
multi-class categorization of medical data. Our work asserts the fact that the Random Tree algorithm and
the Quinlanโs C4.5 algorithm give 100 percent classification accuracy with all the predictor features and
also with the feature subset selected by the Fisher Filtering feature selection algorithm.. It is also stated
here that the Density Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm
offers increased clustering accuracy in less computation time.
Predictive modeling for breast cancer based on machine learning algorithms an...IJECEIAES
ย
Breast cancer is one of the leading causes of death among women worldwide. However, early prediction of breast cancer plays a crucial role. Therefore, strong needs exist for automatic accurate early prediction of breast cancer. In this paper, machine learning (ML) classifiers combined with features selection methods are used to build an intelligent tool for breast cancer prediction. The Wisconsin diagnostic breast cancer (WDBC) dataset is used to train and test the model. Classification algorithms, including support vector machine (SVM), light gradient boosting machine (LightGBM), random forest (RF), logistic regression (LR), k-nearest neighbors (k-NN), and naรฏve Bayes, were employed. Performance measures for each of them were obtained, namely: accuracy, precision, recall, F-score, Kappa, Matthews correlation coefficient (MCC), and time. The results indicate that without feature selection, LightGBM achieves the highest accuracy at 95%. With minimum redundancy maximum relevance (mRMR) feature selection (15 features), LightGBM outperforms other classifiers, achieving an accuracy of 98%. For Pearson correlation coefficient feature selection (15 features), LightGBM also excels with a 95% accuracy rate. Lasso feature selection (5 features) produces varied results across classifiers, with logistic regression achieving the highest accuracy at 96%. These findings underscore the importance of feature selection in refining model performance and in improving detection for breast cancer.
Simplified Knowledge Prediction: Application of Machine Learning in Real LifePeea Bal Chakraborty
ย
Machine learning is the scientific study of algorithms and statistical models that is used by the machines to perform a specific task depending on patterns and inference rather than explicit instructions. This research and analysis aims to observe how precisely a machine can predict that a patient suspected of breast cancer is having malignant or benign cancer.In this paper the classification of cancer type and prediction of risk levels is done by various model of machine learning and is pictorially depicted by various tools of visual analytics.
Bibliometric analysis highlighting the role of women in addressing climate ch...IJECEIAES
ย
Fossil fuel consumption increased quickly, contributing to climate change
that is evident in unusual flooding and draughts, and global warming. Over
the past ten years, women's involvement in society has grown dramatically,
and they succeeded in playing a noticeable role in reducing climate change.
A bibliometric analysis of data from the last ten years has been carried out to
examine the role of women in addressing the climate change. The analysis's
findings discussed the relevant to the sustainable development goals (SDGs),
particularly SDG 7 and SDG 13. The results considered contributions made
by women in the various sectors while taking geographic dispersion into
account. The bibliometric analysis delves into topics including women's
leadership in environmental groups, their involvement in policymaking, their
contributions to sustainable development projects, and the influence of
gender diversity on attempts to mitigate climate change. This study's results
highlight how women have influenced policies and actions related to climate
change, point out areas of research deficiency and recommendations on how
to increase role of the women in addressing the climate change and
achieving sustainability. To achieve more successful results, this initiative
aims to highlight the significance of gender equality and encourage
inclusivity in climate change decision-making processes.
Voltage and frequency control of microgrid in presence of micro-turbine inter...IJECEIAES
ย
The active and reactive load changes have a significant impact on voltage
and frequency. In this paper, in order to stabilize the microgrid (MG) against
load variations in islanding mode, the active and reactive power of all
distributed generators (DGs), including energy storage (battery), diesel
generator, and micro-turbine, are controlled. The micro-turbine generator is
connected to MG through a three-phase to three-phase matrix converter, and
the droop control method is applied for controlling the voltage and
frequency of MG. In addition, a method is introduced for voltage and
frequency control of micro-turbines in the transition state from gridconnected mode to islanding mode. A novel switching strategy of the matrix
converter is used for converting the high-frequency output voltage of the
micro-turbine to the grid-side frequency of the utility system. Moreover,
using the switching strategy, the low-order harmonics in the output current
and voltage are not produced, and consequently, the size of the output filter
would be reduced. In fact, the suggested control strategy is load-independent
and has no frequency conversion restrictions. The proposed approach for
voltage and frequency regulation demonstrates exceptional performance and
favorable response across various load alteration scenarios. The suggested
strategy is examined in several scenarios in the MG test systems, and the
simulation results are addressed.
More Related Content
Similar to Hybrid filtering methods for feature selection in high-dimensional cancer data
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
ย
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast
cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise
the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%
SVM &GA-CLUSTERING BASED FEATURE SELECTION APPROACH FOR BREAST CANCER DETECTIONijscai
ย
Mortality leading among women in developed countries is breast cancer. Breast cancer is women's second most prominent cause of cancer mortality worldwide. In recent decades, women's high prevalence of breast cancer has risen dramatically. This paper discussed several data analysis methods used to detect breast cancer early. Breast cancer diagnosis distinguishes benign and malignant breast lumps. Using data processing tools, we tackled this disease analysis. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. Several clinical breast cancer studies were conducted using soft computing and machine learning techniques. Sometimes their algorithms are easier, easier, or more comprehensive than others. This research is focused on genetic programming and machine
learning algorithms to reliably identify benign and malignant breast cancer. This study aimed to optimise the testing algorithm. We used genetic programming methods to choose classification machines' best features and parameter values. Data mining is an important step of library discovery where intelligent methods are used to detect patterns. We are analysing data accessible from the U.C.I. deep-learning data
set in Wisconsin. In this experiment, we equate four Weka clustering strategies with genetic clustering. A comparison of results reveals that sequential minimal optimization (S.M.O.) is better than I.B.K. and B.F. Tree processes, i.e. 97.71%.
Performance evaluation of random forest with feature selection methods in pre...IJECEIAES
ย
Data mining is nothing but the process of viewing data in different angle and compiling it into appropriate information. Recent improvements in the area of data mining and machine learning have empowered the research in biomedical field to improve the condition of general health care. Since the wrong classification may lead to poor prediction, there is a need to perform the better classification which further improves the prediction rate of the medical datasets. When medical data mining is applied on the medical datasets the important and difficult challenges are the classification and prediction. In this proposed work we evaluate the PIMA Indian Diabtes data set of UCI repository using machine learning algorithm like Random Forest along with feature selection methods such as forward selection and backward elimination based on entropy evaluation method using percentage split as test option. The experiment was conducted using R studio platform and we achieved classification accuracy of 84.1%. From results we can say that Random Forest predicts diabetes better than other techniques with less number of attributes so that one can avoid least important test for identifying diabetes.
An efficient feature selection algorithm for health care data analysisjournalBEEI
ย
Diabete is a silent killer, which will slowly kill the person if it goes undetected. The existing system which uses F-score method and K-means clustering of checking whether a person has diabetes or not are 100% accurate, and anything which isn't a 100% is not acceptable in the medical field, as it could cost the lives of many people. Our proposed system aims at using some of the best features of the existing algorithms to predict diabetes, and combine these and based on these features; This research work turns them into a novel algorithm, which will be 100% accurate in its prediction. With the surge in technological advancements, we can use data mining to predict when a person would be diagnosed with diabetes. Specifically, we analyze the best features of chi-square algorithm and advanced clustering algorithm (ACA). This research work is done using the Pima Indian Diabetes dataset provided by National Institutes of Diabetes and Digestive and Kidney Diseases. Using classification theorems and methods we can consider different factors like age, BMI, blood pressure and the importance given to these attributes overall, and singles these attributes out, and use them for the prediction of diabetes.
A new model for large dataset dimensionality reduction based on teaching lear...TELKOMNIKA JOURNAL
ย
One of the human diseases with a high rate of mortality each year is breast cancer (BC). Among all the forms of cancer, BC is the commonest cause of death among women globally. Some of the effective ways of data classification are data mining and classification methods. These methods are particularly efficient in the medical field due to the presence of irrelevant and redundant attributes in medical datasets. Such redundant attributes are not needed to obtain an accurate estimation of disease diagnosis. Teaching learning-based optimization (TLBO) is a new metaheuristic that has been successfully applied to several intractable optimization problems in recent years. This paper presents the use of a multi-objective TLBO algorithm for the selection of feature subsets in automatic BC diagnosis. For the classification task in this work, the logistic regression (LR) method was deployed. From the results, the projected method produced better BC dataset classification accuracy (classified into malignant and benign). This result showed that the projected TLBO is an efficient features optimization technique for sustaining data-based decision-making systems.
Multivariate sample similarity measure for feature selection with a resemblan...IJECEIAES
ย
Feature selection improves the classification performance of machine learning models. It also identifies the important features and eliminates those with little significance. Furthermore, feature selection reduces the dimensionality of training and testing data points. This study proposes a feature selection method that uses a multivariate sample similarity measure. The method selects features with significant contributions using a machine-learning model. The multivariate sample similarity measure is evaluated using the University of California, Irvine heart disease dataset and compared with existing feature selection methods. The multivariate sample similarity measure is evaluated with metrics such as minimum subset selected, accuracy, F1-score, and area under the curve (AUC). The results show that the proposed method is able to diagnose chest pain, thallium scan, and major vessels scanned using X-rays with a high capability to distinguish between healthy and heart disease patients with a 99.6% accuracy.
Exploring the performance of feature selection method using breast cancer dat...nooriasukmaningtyas
ย
Breast cancer is the most common type of cancer occurring mostly in females. In recent years, many researchers have devoted to automate diagnosis of breast cancer by developing different machine learning model. However, the quality and quantity of feature in breast cancer diagnostic dataset have significant effect on the accuracy and efficiency of predictive model. Feature selection is effective method for reducing the dimensionality and improving the accuracy of predictive model. The use of feature selection is to determine feature required for training model and to remove irrelevant and duplicate feature. Duplicate feature is a feature that is highly correlated to another feature. The objective of this study is to conduct experimental research on three different feature selection methods for breast cancer prediction. Sequential, embedded and chi-square feature selection are implemented using breast cancer diagnostic dataset. The study compares the performance of sequential embedded and chi-square feature selection on test set. The experimental result evidently shows that sequential feature selection outperforms as compared to chi-square (X2) statistics and embedded feature selection. Overall, sequential feature selection achieves better accuracy of 98.3% as compared to chi-square (X2) statistics and embedded feature selection.
Classification AlgorithmBased Analysis of Breast Cancer DataIIRindia
ย
The classification algorithms are very frequently used algorithms for analyzing various kinds of data available in different repositories which have real world applications. The main objective of this research work is to find the performance of classification algorithms in analyzing Breast Cancer data via analyzing the mammogram images based its characteristics.Different attribute values of cancer affected mammogram images are considered for analysis in this work. The Patients food habits, age of the patients, their life styles, occupation, their problem about the diseases and other information are taken into account for classification. Finally, performance of classification algorithms J48, CART and ADTree are given with its accuracy. The accuracy of taken algorithms is measured by various measures like specificity, sensitivity and kappa statistics (Errors).
Enhancing feature selection with a novel hybrid approach incorporating geneti...IJECEIAES
ย
Computing advances in data storage are leading to rapid growth in large-scale datasets. Using all features increases temporal/spatial complexity and negatively influences performance. Feature selection is a fundamental stage in data preprocessing, removing redundant and irrelevant features to minimize the number of features and enhance the performance of classification accuracy. Numerous optimization algorithms were employed to handle feature selection (FS) problems, and they outperform conventional FS techniques. However, there is no metaheuristic FS method that outperforms other optimization algorithms in many datasets. This motivated our study to incorporate the advantages of various optimization techniques to obtain a powerful technique that outperforms other methods in many datasets from different domains. In this article, a novel combined method GASI is developed using swarm intelligence (SI) based feature selection techniques and genetic algorithms (GA) that uses a multi-objective fitness function to seek the optimal subset of features. To assess the performance of the proposed approach, seven datasets have been collected from the UCI repository and exploited to test the newly established feature selection technique. The experimental results demonstrate that the suggested method GASI outperforms many powerful SI-based feature selection techniques studied. GASI obtains a better average fitness value and improves classification performance.
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...rahulmonikasharma
ย
Classification problems in high dimensional information with little sort of observations became furthercommon significantly in microarray information. The increasing amount of text data on internet sites affects the agglomerationanalysis. The text agglomeration could also be a positive analysis technique used for partitioning a huge amount of datainto clusters. Hence, the most necessary draw back that affects the text agglomeration technique is that the presenceuninformative and distributed choices in text documents. A broad class of boosting algorithms is known as actingcoordinate-wise gradient descent to attenuate some potential performs of the margins of a data set. This paperproposes a novel analysis live Q-statistic that comes with the soundness of the chosen feature set to boot to theprediction accuracy. Then we've a bent to propose the Booster of associate degree FS algorithm that enhances theworth of the Q-statistic of the algorithm applied.
A Comparative Analysis of Hybridized Genetic Algorithm in Predictive Models o...Shakas Technologies
ย
A Comparative Analysis of Hybridized Genetic Algorithm in Predictive Models of Breast Cancer Tumors.
Shakas Technologies ( Galaxy of Knowledge)
#11/A 2nd East Main Road,
Gandhi Nagar,
Vellore - 632006.
Mobile : +91-9500218218 / 8220150373| land line- 0416- 3552723
Shakas Training & Development | Shakas Sales & Services | Shakas Educational Trust|IEEE projects | Research & Development | Journal Publication |
Email : info@shakastech.com | shakastech@gmail.com |
website: www.shakastech.comโ
Facebook: https://www.facebook.com/pages/Shakas-Technologies
Evolving Efficient Clustering and Classification Patterns in Lymphography Dat...ijsc
ย
Data mining refers to the process of retrieving knowledge by discovering novel and relative patterns from large datasets. Clustering and Classification are two distinct phases in data mining that work to provide an established, proven structure from a voluminous collection of facts. A dominant area of modern-day research in the field of medical investigations includes disease prediction and malady categorization. In this paper, our focus is to analyze clusters of patient records obtained via unsupervised clustering techniques and compare the performance of classification algorithms on the clinical data. Feature selection is a supervised method that attempts to select a subset of the predictor features based on the information gain. The Lymphography dataset comprises of 18 predictor attributes and 148 instances with the class label having four distinct values. This paper highlights the accuracy of eight clustering algorithms in detecting clusters of patient records and predictor attributes and highlights the performance of sixteen classification algorithms on the Lymphography dataset that enables the classifier to accurately perform multi-class categorization of medical data. Our work asserts the fact that the Random Tree algorithm and the Quinlanโs C4.5 algorithm give 100 percent classification accuracy with all the predictor features and also with the feature subset selected by the Fisher Filtering feature selection algorithm.. It is also stated here that the Density Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm offers increased clustering accuracy in less computation time.
EVOLVING EFFICIENT CLUSTERING AND CLASSIFICATION PATTERNS IN LYMPHOGRAPHY DAT...ijsc
ย
Data mining refers to the process of retrieving knowledge by discovering novel and relative patterns from
large datasets. Clustering and Classification are two distinct phases in data mining that work to provide an
established, proven structure from a voluminous collection of facts. A dominant area of modern-day
research in the field of medical investigations includes disease prediction and malady categorization. In
this paper, our focus is to analyze clusters of patient records obtained via unsupervised clustering
techniques and compare the performance of classification algorithms on the clinical data. Feature
selection is a supervised method that attempts to select a subset of the predictor features based on the
information gain. The Lymphography dataset comprises of 18 predictor attributes and 148 instances with
the class label having four distinct values. This paper highlights the accuracy of eight clustering algorithms
in detecting clusters of patient records and predictor attributes and highlights the performance of sixteen
classification algorithms on the Lymphography dataset that enables the classifier to accurately perform
multi-class categorization of medical data. Our work asserts the fact that the Random Tree algorithm and
the Quinlanโs C4.5 algorithm give 100 percent classification accuracy with all the predictor features and
also with the feature subset selected by the Fisher Filtering feature selection algorithm.. It is also stated
here that the Density Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm
offers increased clustering accuracy in less computation time.
Predictive modeling for breast cancer based on machine learning algorithms an...IJECEIAES
ย
Breast cancer is one of the leading causes of death among women worldwide. However, early prediction of breast cancer plays a crucial role. Therefore, strong needs exist for automatic accurate early prediction of breast cancer. In this paper, machine learning (ML) classifiers combined with features selection methods are used to build an intelligent tool for breast cancer prediction. The Wisconsin diagnostic breast cancer (WDBC) dataset is used to train and test the model. Classification algorithms, including support vector machine (SVM), light gradient boosting machine (LightGBM), random forest (RF), logistic regression (LR), k-nearest neighbors (k-NN), and naรฏve Bayes, were employed. Performance measures for each of them were obtained, namely: accuracy, precision, recall, F-score, Kappa, Matthews correlation coefficient (MCC), and time. The results indicate that without feature selection, LightGBM achieves the highest accuracy at 95%. With minimum redundancy maximum relevance (mRMR) feature selection (15 features), LightGBM outperforms other classifiers, achieving an accuracy of 98%. For Pearson correlation coefficient feature selection (15 features), LightGBM also excels with a 95% accuracy rate. Lasso feature selection (5 features) produces varied results across classifiers, with logistic regression achieving the highest accuracy at 96%. These findings underscore the importance of feature selection in refining model performance and in improving detection for breast cancer.
Simplified Knowledge Prediction: Application of Machine Learning in Real LifePeea Bal Chakraborty
ย
Machine learning is the scientific study of algorithms and statistical models that is used by the machines to perform a specific task depending on patterns and inference rather than explicit instructions. This research and analysis aims to observe how precisely a machine can predict that a patient suspected of breast cancer is having malignant or benign cancer.In this paper the classification of cancer type and prediction of risk levels is done by various model of machine learning and is pictorially depicted by various tools of visual analytics.
Similar to Hybrid filtering methods for feature selection in high-dimensional cancer data (20)
Bibliometric analysis highlighting the role of women in addressing climate ch...IJECEIAES
ย
Fossil fuel consumption increased quickly, contributing to climate change
that is evident in unusual flooding and draughts, and global warming. Over
the past ten years, women's involvement in society has grown dramatically,
and they succeeded in playing a noticeable role in reducing climate change.
A bibliometric analysis of data from the last ten years has been carried out to
examine the role of women in addressing the climate change. The analysis's
findings discussed the relevant to the sustainable development goals (SDGs),
particularly SDG 7 and SDG 13. The results considered contributions made
by women in the various sectors while taking geographic dispersion into
account. The bibliometric analysis delves into topics including women's
leadership in environmental groups, their involvement in policymaking, their
contributions to sustainable development projects, and the influence of
gender diversity on attempts to mitigate climate change. This study's results
highlight how women have influenced policies and actions related to climate
change, point out areas of research deficiency and recommendations on how
to increase role of the women in addressing the climate change and
achieving sustainability. To achieve more successful results, this initiative
aims to highlight the significance of gender equality and encourage
inclusivity in climate change decision-making processes.
Voltage and frequency control of microgrid in presence of micro-turbine inter...IJECEIAES
ย
The active and reactive load changes have a significant impact on voltage
and frequency. In this paper, in order to stabilize the microgrid (MG) against
load variations in islanding mode, the active and reactive power of all
distributed generators (DGs), including energy storage (battery), diesel
generator, and micro-turbine, are controlled. The micro-turbine generator is
connected to MG through a three-phase to three-phase matrix converter, and
the droop control method is applied for controlling the voltage and
frequency of MG. In addition, a method is introduced for voltage and
frequency control of micro-turbines in the transition state from gridconnected mode to islanding mode. A novel switching strategy of the matrix
converter is used for converting the high-frequency output voltage of the
micro-turbine to the grid-side frequency of the utility system. Moreover,
using the switching strategy, the low-order harmonics in the output current
and voltage are not produced, and consequently, the size of the output filter
would be reduced. In fact, the suggested control strategy is load-independent
and has no frequency conversion restrictions. The proposed approach for
voltage and frequency regulation demonstrates exceptional performance and
favorable response across various load alteration scenarios. The suggested
strategy is examined in several scenarios in the MG test systems, and the
simulation results are addressed.
Enhancing battery system identification: nonlinear autoregressive modeling fo...IJECEIAES
ย
Precisely characterizing Li-ion batteries is essential for optimizing their
performance, enhancing safety, and prolonging their lifespan across various
applications, such as electric vehicles and renewable energy systems. This
article introduces an innovative nonlinear methodology for system
identification of a Li-ion battery, employing a nonlinear autoregressive with
exogenous inputs (NARX) model. The proposed approach integrates the
benefits of nonlinear modeling with the adaptability of the NARX structure,
facilitating a more comprehensive representation of the intricate
electrochemical processes within the battery. Experimental data collected
from a Li-ion battery operating under diverse scenarios are employed to
validate the effectiveness of the proposed methodology. The identified
NARX model exhibits superior accuracy in predicting the battery's behavior
compared to traditional linear models. This study underscores the
importance of accounting for nonlinearities in battery modeling, providing
insights into the intricate relationships between state-of-charge, voltage, and
current under dynamic conditions.
Smart grid deployment: from a bibliometric analysis to a surveyIJECEIAES
ย
Smart grids are one of the last decades' innovations in electrical energy.
They bring relevant advantages compared to the traditional grid and
significant interest from the research community. Assessing the field's
evolution is essential to propose guidelines for facing new and future smart
grid challenges. In addition, knowing the main technologies involved in the
deployment of smart grids (SGs) is important to highlight possible
shortcomings that can be mitigated by developing new tools. This paper
contributes to the research trends mentioned above by focusing on two
objectives. First, a bibliometric analysis is presented to give an overview of
the current research level about smart grid deployment. Second, a survey of
the main technological approaches used for smart grid implementation and
their contributions are highlighted. To that effect, we searched the Web of
Science (WoS), and the Scopus databases. We obtained 5,663 documents
from WoS and 7,215 from Scopus on smart grid implementation or
deployment. With the extraction limitation in the Scopus database, 5,872 of
the 7,215 documents were extracted using a multi-step process. These two
datasets have been analyzed using a bibliometric tool called bibliometrix.
The main outputs are presented with some recommendations for future
research.
Use of analytical hierarchy process for selecting and prioritizing islanding ...IJECEIAES
ย
One of the problems that are associated to power systems is islanding
condition, which must be rapidly and properly detected to prevent any
negative consequences on the system's protection, stability, and security.
This paper offers a thorough overview of several islanding detection
strategies, which are divided into two categories: classic approaches,
including local and remote approaches, and modern techniques, including
techniques based on signal processing and computational intelligence.
Additionally, each approach is compared and assessed based on several
factors, including implementation costs, non-detected zones, declining
power quality, and response times using the analytical hierarchy process
(AHP). The multi-criteria decision-making analysis shows that the overall
weight of passive methods (24.7%), active methods (7.8%), hybrid methods
(5.6%), remote methods (14.5%), signal processing-based methods (26.6%),
and computational intelligent-based methods (20.8%) based on the
comparison of all criteria together. Thus, it can be seen from the total weight
that hybrid approaches are the least suitable to be chosen, while signal
processing-based methods are the most appropriate islanding detection
method to be selected and implemented in power system with respect to the
aforementioned factors. Using Expert Choice software, the proposed
hierarchy model is studied and examined.
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...IJECEIAES
ย
The power generated by photovoltaic (PV) systems is influenced by
environmental factors. This variability hampers the control and utilization of
solar cells' peak output. In this study, a single-stage grid-connected PV
system is designed to enhance power quality. Our approach employs fuzzy
logic in the direct power control (DPC) of a three-phase voltage source
inverter (VSI), enabling seamless integration of the PV connected to the
grid. Additionally, a fuzzy logic-based maximum power point tracking
(MPPT) controller is adopted, which outperforms traditional methods like
incremental conductance (INC) in enhancing solar cell efficiency and
minimizing the response time. Moreover, the inverter's real-time active and
reactive power is directly managed to achieve a unity power factor (UPF).
The system's performance is assessed through MATLAB/Simulink
implementation, showing marked improvement over conventional methods,
particularly in steady-state and varying weather conditions. For solar
irradiances of 500 and 1,000 W/m2
, the results show that the proposed
method reduces the total harmonic distortion (THD) of the injected current
to the grid by approximately 46% and 38% compared to conventional
methods, respectively. Furthermore, we compare the simulation results with
IEEE standards to evaluate the system's grid compatibility.
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...IJECEIAES
ย
Photovoltaic systems have emerged as a promising energy resource that
caters to the future needs of society, owing to their renewable, inexhaustible,
and cost-free nature. The power output of these systems relies on solar cell
radiation and temperature. In order to mitigate the dependence on
atmospheric conditions and enhance power tracking, a conventional
approach has been improved by integrating various methods. To optimize
the generation of electricity from solar systems, the maximum power point
tracking (MPPT) technique is employed. To overcome limitations such as
steady-state voltage oscillations and improve transient response, two
traditional MPPT methods, namely fuzzy logic controller (FLC) and perturb
and observe (P&O), have been modified. This research paper aims to
simulate and validate the step size of the proposed modified P&O and FLC
techniques within the MPPT algorithm using MATLAB/Simulink for
efficient power tracking in photovoltaic systems.
Adaptive synchronous sliding control for a robot manipulator based on neural ...IJECEIAES
ย
Robot manipulators have become important equipment in production lines, medical fields, and transportation. Improving the quality of trajectory tracking for
robot hands is always an attractive topic in the research community. This is a
challenging problem because robot manipulators are complex nonlinear systems
and are often subject to fluctuations in loads and external disturbances. This
article proposes an adaptive synchronous sliding control scheme to improve trajectory tracking performance for a robot manipulator. The proposed controller
ensures that the positions of the joints track the desired trajectory, synchronize
the errors, and significantly reduces chattering. First, the synchronous tracking
errors and synchronous sliding surfaces are presented. Second, the synchronous
tracking error dynamics are determined. Third, a robust adaptive control law is
designed,the unknown components of the model are estimated online by the neural network, and the parameters of the switching elements are selected by fuzzy
logic. The built algorithm ensures that the tracking and approximation errors
are ultimately uniformly bounded (UUB). Finally, the effectiveness of the constructed algorithm is demonstrated through simulation and experimental results.
Simulation and experimental results show that the proposed controller is effective with small synchronous tracking errors, and the chattering phenomenon is
significantly reduced.
Remote field-programmable gate array laboratory for signal acquisition and de...IJECEIAES
ย
A remote laboratory utilizing field-programmable gate array (FPGA) technologies enhances studentsโ learning experience anywhere and anytime in embedded system design. Existing remote laboratories prioritize hardware access and visual feedback for observing board behavior after programming, neglecting comprehensive debugging tools to resolve errors that require internal signal acquisition. This paper proposes a novel remote embeddedsystem design approach targeting FPGA technologies that are fully interactive via a web-based platform. Our solution provides FPGA board access and debugging capabilities beyond the visual feedback provided by existing remote laboratories. We implemented a lab module that allows users to seamlessly incorporate into their FPGA design. The module minimizes hardware resource utilization while enabling the acquisition of a large number of data samples from the signal during the experiments by adaptively compressing the signal prior to data transmission. The results demonstrate an average compression ratio of 2.90 across three benchmark signals, indicating efficient signal acquisition and effective debugging and analysis. This method allows users to acquire more data samples than conventional methods. The proposed lab allows students to remotely test and debug their designs, bridging the gap between theory and practice in embedded system design.
Detecting and resolving feature envy through automated machine learning and m...IJECEIAES
ย
Efficiently identifying and resolving code smells enhances software project quality. This paper presents a novel solution, utilizing automated machine learning (AutoML) techniques, to detect code smells and apply move method refactoring. By evaluating code metrics before and after refactoring, we assessed its impact on coupling, complexity, and cohesion. Key contributions of this research include a unique dataset for code smell classification and the development of models using AutoGluon for optimal performance. Furthermore, the study identifies the top 20 influential features in classifying feature envy, a well-known code smell, stemming from excessive reliance on external classes. We also explored how move method refactoring addresses feature envy, revealing reduced coupling and complexity, and improved cohesion, ultimately enhancing code quality. In summary, this research offers an empirical, data-driven approach, integrating AutoML and move method refactoring to optimize software project quality. Insights gained shed light on the benefits of refactoring on code quality and the significance of specific features in detecting feature envy. Future research can expand to explore additional refactoring techniques and a broader range of code metrics, advancing software engineering practices and standards.
Smart monitoring technique for solar cell systems using internet of things ba...IJECEIAES
ย
Rapidly and remotely monitoring and receiving the solar cell systems status parameters, solar irradiance, temperature, and humidity, are critical issues in enhancement their efficiency. Hence, in the present article an improved smart prototype of internet of things (IoT) technique based on embedded system through NodeMCU ESP8266 (ESP-12E) was carried out experimentally. Three different regions at Egypt; Luxor, Cairo, and El-Beheira cities were chosen to study their solar irradiance profile, temperature, and humidity by the proposed IoT system. The monitoring data of solar irradiance, temperature, and humidity were live visualized directly by Ubidots through hypertext transfer protocol (HTTP) protocol. The measured solar power radiation in Luxor, Cairo, and El-Beheira ranged between 216-1000, 245-958, and 187-692 W/m 2 respectively during the solar day. The accuracy and rapidity of obtaining monitoring results using the proposed IoT system made it a strong candidate for application in monitoring solar cell systems. On the other hand, the obtained solar power radiation results of the three considered regions strongly candidate Luxor and Cairo as suitable places to build up a solar cells system station rather than El-Beheira.
An efficient security framework for intrusion detection and prevention in int...IJECEIAES
ย
Over the past few years, the internet of things (IoT) has advanced to connect billions of smart devices to improve quality of life. However, anomalies or malicious intrusions pose several security loopholes, leading to performance degradation and threat to data security in IoT operations. Thereby, IoT security systems must keep an eye on and restrict unwanted events from occurring in the IoT network. Recently, various technical solutions based on machine learning (ML) models have been derived towards identifying and restricting unwanted events in IoT. However, most ML-based approaches are prone to miss-classification due to inappropriate feature selection. Additionally, most ML approaches applied to intrusion detection and prevention consider supervised learning, which requires a large amount of labeled data to be trained. Consequently, such complex datasets are impossible to source in a large network like IoT. To address this problem, this proposed study introduces an efficient learning mechanism to strengthen the IoT security aspects. The proposed algorithm incorporates supervised and unsupervised approaches to improve the learning models for intrusion detection and mitigation. Compared with the related works, the experimental outcome shows that the model performs well in a benchmark dataset. It accomplishes an improved detection accuracy of approximately 99.21%.
Developing a smart system for infant incubators using the internet of things ...IJECEIAES
ย
This research is developing an incubator system that integrates the internet of things and artificial intelligence to improve care for premature babies. The system workflow starts with sensors that collect data from the incubator. Then, the data is sent in real-time to the internet of things (IoT) broker eclipse mosquito using the message queue telemetry transport (MQTT) protocol version 5.0. After that, the data is stored in a database for analysis using the long short-term memory network (LSTM) method and displayed in a web application using an application programming interface (API) service. Furthermore, the experimental results produce as many as 2,880 rows of data stored in the database. The correlation coefficient between the target attribute and other attributes ranges from 0.23 to 0.48. Next, several experiments were conducted to evaluate the model-predicted value on the test data. The best results are obtained using a two-layer LSTM configuration model, each with 60 neurons and a lookback setting 6. This model produces an R 2 value of 0.934, with a root mean square error (RMSE) value of 0.015 and a mean absolute error (MAE) of 0.008. In addition, the R 2 value was also evaluated for each attribute used as input, with a result of values between 0.590 and 0.845.
A review on internet of things-based stingless bee's honey production with im...IJECEIAES
ย
Honey is produced exclusively by honeybees and stingless bees which both are well adapted to tropical and subtropical regions such as Malaysia. Stingless bees are known for producing small amounts of honey and are known for having a unique flavor profile. Problem identified that many stingless bees collapsed due to weather, temperature and environment. It is critical to understand the relationship between the production of stingless bee honey and environmental conditions to improve honey production. Thus, this paper presents a review on stingless bee's honey production and prediction modeling. About 54 previous research has been analyzed and compared in identifying the research gaps. A framework on modeling the prediction of stingless bee honey is derived. The result presents the comparison and analysis on the internet of things (IoT) monitoring systems, honey production estimation, convolution neural networks (CNNs), and automatic identification methods on bee species. It is identified based on image detection method the top best three efficiency presents CNN is at 98.67%, densely connected convolutional networks with YOLO v3 is 97.7%, and DenseNet201 convolutional networks 99.81%. This study is significant to assist the researcher in developing a model for predicting stingless honey produced by bee's output, which is important for a stable economy and food security.
A trust based secure access control using authentication mechanism for intero...IJECEIAES
ย
The internet of things (IoT) is a revolutionary innovation in many aspects of our society including interactions, financial activity, and global security such as the military and battlefield internet. Due to the limited energy and processing capacity of network devices, security, energy consumption, compatibility, and device heterogeneity are the long-term IoT problems. As a result, energy and security are critical for data transmission across edge and IoT networks. Existing IoT interoperability techniques need more computation time, have unreliable authentication mechanisms that break easily, lose data easily, and have low confidentiality. In this paper, a key agreement protocol-based authentication mechanism for IoT devices is offered as a solution to this issue. This system makes use of information exchange, which must be secured to prevent access by unauthorized users. Using a compact contiki/cooja simulator, the performance and design of the suggested framework are validated. The simulation findings are evaluated based on detection of malicious nodes after 60 minutes of simulation. The suggested trust method, which is based on privacy access control, reduced packet loss ratio to 0.32%, consumed 0.39% power, and had the greatest average residual energy of 0.99 mJoules at 10 nodes.
Fuzzy linear programming with the intuitionistic polygonal fuzzy numbersIJECEIAES
ย
In real world applications, data are subject to ambiguity due to several factors; fuzzy sets and fuzzy numbers propose a great tool to model such ambiguity. In case of hesitation, the complement of a membership value in fuzzy numbers can be different from the non-membership value, in which case we can model using intuitionistic fuzzy numbers as they provide flexibility by defining both a membership and a non-membership functions. In this article, we consider the intuitionistic fuzzy linear programming problem with intuitionistic polygonal fuzzy numbers, which is a generalization of the previous polygonal fuzzy numbers found in the literature. We present a modification of the simplex method that can be used to solve any general intuitionistic fuzzy linear programming problem after approximating the problem by an intuitionistic polygonal fuzzy number with n edges. This method is given in a simple tableau formulation, and then applied on numerical examples for clarity.
The performance of artificial intelligence in prostate magnetic resonance im...IJECEIAES
ย
Prostate cancer is the predominant form of cancer observed in men worldwide. The application of magnetic resonance imaging (MRI) as a guidance tool for conducting biopsies has been established as a reliable and well-established approach in the diagnosis of prostate cancer. The diagnostic performance of MRI-guided prostate cancer diagnosis exhibits significant heterogeneity due to the intricate and multi-step nature of the diagnostic pathway. The development of artificial intelligence (AI) models, specifically through the utilization of machine learning techniques such as deep learning, is assuming an increasingly significant role in the field of radiology. In the realm of prostate MRI, a considerable body of literature has been dedicated to the development of various AI algorithms. These algorithms have been specifically designed for tasks such as prostate segmentation, lesion identification, and classification. The overarching objective of these endeavors is to enhance diagnostic performance and foster greater agreement among different observers within MRI scans for the prostate. This review article aims to provide a concise overview of the application of AI in the field of radiology, with a specific focus on its utilization in prostate MRI.
Seizure stage detection of epileptic seizure using convolutional neural networksIJECEIAES
ย
According to the World Health Organization (WHO), seventy million individuals worldwide suffer from epilepsy, a neurological disorder. While electroencephalography (EEG) is crucial for diagnosing epilepsy and monitoring the brain activity of epilepsy patients, it requires a specialist to examine all EEG recordings to find epileptic behavior. This procedure needs an experienced doctor, and a precise epilepsy diagnosis is crucial for appropriate treatment. To identify epileptic seizures, this study employed a convolutional neural network (CNN) based on raw scalp EEG signals to discriminate between preictal, ictal, postictal, and interictal segments. The possibility of these characteristics is explored by examining how well timedomain signals work in the detection of epileptic signals using intracranial Freiburg Hospital (FH), scalp Children's Hospital Boston-Massachusetts Institute of Technology (CHB-MIT) databases, and Temple University Hospital (TUH) EEG. To test the viability of this approach, two types of experiments were carried out. Firstly, binary class classification (preictal, ictal, postictal each versus interictal) and four-class classification (interictal versus preictal versus ictal versus postictal). The average accuracy for stage detection using CHB-MIT database was 84.4%, while the Freiburg database's time-domain signals had an accuracy of 79.7% and the highest accuracy of 94.02% for classification in the TUH EEG database when comparing interictal stage to preictal stage.
Analysis of driving style using self-organizing maps to analyze driver behaviorIJECEIAES
ย
Modern life is strongly associated with the use of cars, but the increase in acceleration speeds and their maneuverability leads to a dangerous driving style for some drivers. In these conditions, the development of a method that allows you to track the behavior of the driver is relevant. The article provides an overview of existing methods and models for assessing the functioning of motor vehicles and driver behavior. Based on this, a combined algorithm for recognizing driving style is proposed. To do this, a set of input data was formed, including 20 descriptive features: About the environment, the driver's behavior and the characteristics of the functioning of the car, collected using OBD II. The generated data set is sent to the Kohonen network, where clustering is performed according to driving style and degree of danger. Getting the driving characteristics into a particular cluster allows you to switch to the private indicators of an individual driver and considering individual driving characteristics. The application of the method allows you to identify potentially dangerous driving styles that can prevent accidents.
Hyperspectral object classification using hybrid spectral-spatial fusion and ...IJECEIAES
ย
Because of its spectral-spatial and temporal resolution of greater areas, hyperspectral imaging (HSI) has found widespread application in the field of object classification. The HSI is typically used to accurately determine an object's physical characteristics as well as to locate related objects with appropriate spectral fingerprints. As a result, the HSI has been extensively applied to object identification in several fields, including surveillance, agricultural monitoring, environmental research, and precision agriculture. However, because of their enormous size, objects require a lot of time to classify; for this reason, both spectral and spatial feature fusion have been completed. The existing classification strategy leads to increased misclassification, and the feature fusion method is unable to preserve semantic object inherent features; This study addresses the research difficulties by introducing a hybrid spectral-spatial fusion (HSSF) technique to minimize feature size while maintaining object intrinsic qualities; Lastly, a soft-margins kernel is proposed for multi-layer deep support vector machine (MLDSVM) to reduce misclassification. The standard Indian pines dataset is used for the experiment, and the outcome demonstrates that the HSSF-MLDSVM model performs substantially better in terms of accuracy and Kappa coefficient.
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
ย
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with ๏ฌy-by-wire ๏ฌight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
ย
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Final project report on grocery store management system..pdfKamal Acharya
ย
In todayโs fast-changing business environment, itโs extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
โข Remote control: Parallel or serial interface.
โข Compatible with MAFI CCR system.
โข Compatible with IDM8000 CCR.
โข Compatible with Backplane mount serial communication.
โข Compatible with commercial and Defence aviation CCR system.
โข Remote control system for accessing CCR and allied system over serial or TCP.
โข Indigenized local Support/presence in India.
โข Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
โข Remote control: Parallel or serial interface
โข Compatible with MAFI CCR system
โข Copatiable with IDM8000 CCR
โข Compatible with Backplane mount serial communication.
โข Compatible with commercial and Defence aviation CCR system.
โข Remote control system for accessing CCR and allied system over serial or TCP.
โข Indigenized local Support/presence in India.
Application
โข Remote control: Parallel or serial interface.
โข Compatible with MAFI CCR system.
โข Compatible with IDM8000 CCR.
โข Compatible with Backplane mount serial communication.
โข Compatible with commercial and Defence aviation CCR system.
โข Remote control system for accessing CCR and allied system over serial or TCP.
โข Indigenized local Support/presence in India.
โข Easy in configuration using DIP switches.
Cosmetic shop management system project report.pdfKamal Acharya
ย
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Courier management system project report.pdfKamal Acharya
ย
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
Nuclear Power Economics and Structuring 2024Massimo Talia
ย
Title: Nuclear Power Economics and Structuring - 2024 Edition
Produced by: World Nuclear Association Published: April 2024
Report No. 2024/001
ยฉ 2024 World Nuclear Association.
Registered in England and Wales, company number 01215741
This report reflects the views
of industry experts but does not
necessarily represent those
of World Nuclear Associationโs
individual member organizations.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologistโs survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
ย
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the studentโs details, driverโs details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
ย
Hybrid filtering methods for feature selection in high-dimensional cancer data
1. International Journal of Electrical and Computer Engineering (IJECE)
Vol. 13, No. 6, December 2023, pp. 6862~6871
ISSN: 2088-8708, DOI: 10.11591/ijece.v13i6.pp6862-6871 ๏ฒ 6862
Journal homepage: http://ijece.iaescore.com
Hybrid filtering methods for feature selection in high-
dimensional cancer data
Siti Sarah Md Noh1
, Nurain Ibrahim1,3
, Mahayaudin M. Mansor1
, Marina Yusoff2,3
1
School of Mathematical Sciences, College of Computing, Informatics and Mathematics, Universiti Teknologi MARA,
Shah Alam, Malaysia
2
School of Computing Sciences, College of Computing, Informatics and Mathematics, Universiti Teknologi MARA,
Shah Alam, Malaysia
3
Institute for Big Data Analytics and Artificial Intelligence, Kompleks Al-Khawarizmi, Universiti Teknologi MARA,
Shah Alam, Malaysia
Article Info ABSTRACT
Article history:
Received Jan 31, 2023
Revised Apr 8, 2023
Accepted Apr 14, 2023
Statisticians in both academia and industry have encountered problems with
high-dimensional data. The rapid feature increase has caused the feature count
to outstrip the instance count. There are several established methods when
selecting features from massive amounts of breast cancer data. Even so,
overfitting continues to be a problem. The challenge of choosing important
features with minimum loss in a different sample size is another area with room
for development. As a result, the feature selection technique is crucial for
dealing with high-dimensional data classification issues. This paper proposed
a new architecture for high-dimensional breast cancer data using filtering
techniques and a logistic regression model. Essential features are filtered out
using a combination of hybrid chiโsquare and hybrid information gain (hybrid
IG) with logistic regression as classifier. The results showed that hybrid IG
performed the best for high-dimensional breast and prostate cancer data. The
top 50 and 22 features outperformed the other configurations, with the highest
classification accuracies of 86.96% and 82.61%, respectively, after integrating
the hybrid information gain and logistic function (hybrid IG+LR) with a sample
size of 75. In the future, multiclass classification of multidimensional medical data
to be evaluated using data from a different domain.
Keywords:
Breast cancer
Classification
Filtering method
High-dimensional data
Logistic regression
Prostate cancer
This is an open access article under the CC BY-SA license.
Corresponding Author:
Nurain Ibrahim
School of Mathematical Sciences, College of Computing, Informatics and Mathematics, Universiti Teknologi
MARA
Shah Alam, Malaysia
Email: nurainibrahim@uitm.edu.my
1. INTRODUCTION
In the medical field of breast and prostate cancer, high-dimensional data is defined as when the
number of variables or features exceeds the number of observations. Statistical scientists in academia and
industry encounter this evidence daily. Most researchers have always worked with data that has many features.
However, advances in data storage and computing capacity have resulted in the development of high-dimensional
data in various fields, such as genetics, signal processing, and finance. The accuracy of classification algorithms
tends to decline in high dimensions data due to a phenomenon known as the curse of dimensionality. When the
dimensionality increases, the volume of space expands rapidly, and the accessible data becomes sparse.
Furthermore, recognizing areas where objects form groups with similar attributes is widely used to
organize and search data. However, due to the high dimensionality of the data, it became sparse, resulting in
2. Int J Elec & Comp Eng ISSN: 2088-8708 ๏ฒ
Hybrid filtering methods for feature selection in high-dimensional cancer data (Siti Sarah Md Noh)
6863
increased errors. It is challenging to design an algorithm for dealing with high-dimensional data. The ability of
an algorithm to give a precise result and converge to an accurate model decreases as dimensionality increases.
The next challenge in modeling high-dimensional data is to avoid overfitting. It is critical to develop
a classification model that is capable of generalization. The classification model must perform admirably in
both training and testing data. Nonetheless, the small number of samples on high-dimensional data can cause
the classification model to be overfitted to the training data, resulting in poor model generalization ability. To
avoid the abovementioned issues, feature selection must be applied to high-dimensional data beforehand to
select only the significant features. The problem is determining the most efficient method to determine the
relevant elements with less loss in different sample sizes. Several studies on high-dimensional classification
reporting methods have been published in the literature. Liang et al. [1] proposed conditional mutual
information-based feature selection with interaction to reduce performance error [2]. Tally et al. [3] discovered
the genetic algorithm feature selection with a support vector machine classifier for intrusion detection, while
Sagban et al. [2] investigated the performance of feature selection applied to cervical cancer data. Ibrahim and
Kamarudin [4] applied filter feature selection method to improve heart failure data classification.
Guo et al. [5] proposed weighting and ranking-based hybrid feature selection to select essential risk
factors for ischemic stroke detection, and Cekik et al. [6] developed a proportional rough feature selector to
classify short text. Too many researchers focus on fusing feature selection, leaving traditional techniques like
filter, wrapper, and embedded unstudied. A traditional feature selection method contains no hybrid or novel
features. Way et al. [7] also investigated how small sample size affects feature selection and classification
accuracy. More research is needed to understand how small sample sizes affect high-dimensional data.
Wah et al. [8] compared information gain and correlation-based feature selection, wrapper sequential forward
and sequential backward elimination to maximize classifier accuracy but still need to include the embedded
technique. As a result, this study proposes the best integration method for evaluating high-dimensional data
classification performance. As a result, breast cancer classification would improve with key features. As a
result, much of this research topic requires further investigation.
The rest of the paper is organized as follows: section 2 explains material and method. Section 3
presents the results and the discussion of the experiment. Section 4 constitutes the conclusion.
2. MATERIAL AND METHOD
This section describes the structure, method, and procedure of the study. The methodology for this
study includes data collection, preprocessing, feature extraction, and modeling. Patients with T1T2N0 breast
carcinoma were studied at the Marie Curie Institute in Paris [9]. Data on prostate cancer is also being used, and
the data were first analyzed by Singh et al. [10]. To remove unnecessary information, the data is preprocessed.
The training set will be filtered using hybrid information gain (hybrid IG) and hybrid chiโsquare to select
essential features. After that, the data is used to generate training and testing sets. Figure 1 depicts the
conceptual research methodology for the study. To begin, two high-dimensional data sets, breast and prostate
cancer, will be entered into the R software. The data will be preprocessed, which includes detecting, removing,
or replacing missing values with appropriate ones and checking for redundant ones. Hence, 100 samples at
random and 75 sample sizes before dividing the data into 70% and 70%, where 70% for training and 30% for
testing [11], [12]. Training data samples were used as input for the classifier to learn the features and build a
new learning model [13].
Meanwhile, the learning model was used during the testing phase to predict the test data. The training
data will be filtered using two filter selection methods after the data has been split: information gain and
chi-square. During this process, the important features were identified, and several features began to be reduced
based on their ranking. The ranking was sorted based on the weight of the individual features, with a higher
weight indicating a status of an importance feature. Following that, all of the identified essential features used
in this study have ๐ =50, 22, and 10, where ๐ is the dimension of reduced features, and will be fed into the
logistic regression data mining algorithm, from which a new model was built. The performance of each data
was then predicted and evaluated using testing data. Finally, the testing data was used to make a prediction and
the classifier's performance was evaluated using accuracy, sensitivity, specificity, and precision.
2.1. Data acquisition
The breast cancer data was first explored [9]. It was collected from a patient at Institute Curie for ten
years from 1989 until 1999 or pT1T2N0 breast carcinoma. The data has clinicopathological characteristics of
the tumor and the gene expression that had two classes where patients with no event after diagnosis were
labeled good and patients with early metastatic will be labeled poor. The data consist of 2,905 genes with only
168 samples. Recent studies show that many researchers used breast cancer data to tackle the problem of high-
dimensional data [14], [15]. The second data will be used in prostate cancer, which was initially analyzed by
[10]. It was gathered from 1995 to 1997 from Brigham and Women's Hospital patients who were having radical
3. ๏ฒ ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 6, December 2023: 6862-6871
6864
prostatectomy surgery. The data contains 102 patterns of gene expression, of which 50 are from normal prostate
specimens, and 52 are from tumors. The data, a collection of gene expression data based on oligonucleotide
microarray, contains roughly 12,600 genes. Past studies show that there are a lot of past researchers that used
prostate cancer data in investigating the classification of high-dimensional data [16]โ[19]. The summary of
these high-dimensional cancer data is shown in Table 1.
Figure 1. Conceptual research methodology
Table 1. Summary of high-dimensional cancer data
Data No. of features No. of samples No. of classes Reference
Breast cancer 2905 168 2 [9]
Prostate cancer 12600 102 2 [10]
2.2. Data preparation
Data preparation consists of preprocessing, identifying sample size, and selecting features. Pre-
processing entails cleaning the data by detecting, removing, or replacing missing values with appropriate ones
and examining redundant ones. The sample sizes are determined based on the previous study [20]. This study
uses 75, 100, and the maximum sample size (full sample size of data set) as sample size configurations. The
top 50 important, top 22 important, and top 10 important features are listed. The features identified are based
on industry standards [21].
2.3. Filtering steps
For the feature selection process, filtering methods are chosen. Two filtering methods were employed
to obtain three subsets of essential features. The filter selection method is a feature ranking technique that
assesses the features independently, based on the data characteristic, without involving any learning algorithm
or classifier. Each variable is scored using a suitable ranking criterion by assigning weight for each feature.
The weight under the threshold value would be deemed unimportant and removed [22]. Then, all the reduced
features would be input into the learning algorithm to assess the performance of the measurement.
2.4. Information gain
Based on the literature review, information gain was one of the most widely used univariate filters in
evaluating the attributes [8], [23], [24]. This filter method analyzes a single feature at a time based on the
4. Int J Elec & Comp Eng ISSN: 2088-8708 ๏ฒ
Hybrid filtering methods for feature selection in high-dimensional cancer data (Siti Sarah Md Noh)
6865
information gained. It uses entropy measures to rank the variables. It is calculated by calculating the changes
in entropy from a previous state to the known value state. The entropy for class features ๐ is presented as (1).
๐ป(๐) = โ ๐(๐ฆ)๐๐๐2(๐(๐ฆ))
๐
๐=1 (1)
The marginal probability density function for the random variable ๐ is denoted by ๐(๐ฆ). The marginal
probability density function for the random variable ๐ is denoted by ๐(๐ฆ). There was a link between feature ๐
and ๐. It occurs in the training data, where the observed value for ๐ฆ was partitioned based on the second feature
๐. The result of partitioning makes the entropy of ๐ฆ produced by ๐ regarding the partition less than that of ๐
before the partitioning. Hence, the entropy of ๐ after observing ๐ is stated in (2). Once both entropy is
computed, the differences are calculated to determine the gain value. The gain values from ๐ and ๐ are the
reduction in entropy values known as information gain. The calculation is as in (3).
๐ป(๐|๐) = โ ๐(๐ฅ)
๐
๐=1 โ ๐(๐ฆ|๐ฅ)๐๐๐2(๐(๐ฆ|๐ฅ))
๐
๐=1 (2)
๐ผ๐บ(๐) = ๐ป(๐) โ ๐ป(๐|๐) (3)
The feature will each be ranked with its own information gain value. Higher information gain value
will hold more information. After obtaining the information gain value, a threshold is needed to select the
important features according to the order accepted. However, the weakness of using information gain is that it
does not remove redundant data and needs to be more balanced with features that have more value, even though
it only holds a little information.
2.5. Chiโsquare
Chi-square is a univariate filter algorithm that uses a test of independence evaluation to measure each
feature's merit using a discretization algorithm [25]. This method assesses each feature individually by
calculating chiโsquare statistics concerning each class [26], [27]. A relevant feature will have a high chiโsquare
value for each class. The equation of measure is shown in (4). Given from (4), where ๐ denotes the number of
intervals, ๐ต is the number of classes, ๐ denotes the total number of instances, ๐ ๐ refers to the number of
instances in the ith
range, ๐ต๐ the number of instances in ๐th
class and finally, ๐ด๐๐ is the number of instances in
the ๐th
range and ๐th
class.
๐2
= โ โ
[๐ด๐๐โ
๐ ๐โ๐ต๐
๐
]
2
๐ ๐โ๐ต๐
๐
๐ต
๐=1
๐
๐=1 , (4)
2.6. Logistic regression model
Logistic regression assigns each independent variable a coefficient that explains the contribution to
variation in the independent variable. If the response is "Yes," the dependent variable will become 1. Otherwise,
it will become 0. The predicted probabilities model is expressed as a natural logarithm (๐๐) of the odds ratio
and the linear logistic model is shown as in (5) [28].
๐ฅ๐๐ = ๐ฝ0 + ๐ฝ1๐ฅ๐1 + ๐ฝ2๐ฅ๐2 โฆ . . +๐ฝ๐๐ฅ๐๐ (5)
๐๐ [
๐ข๐
1โ๐ข๐
] = ๐ฅ๐๐ = ๐ฝ0 + ๐ฝ1๐ฅ๐1 + ๐ฝ2๐ฅ๐2 โฆ . . +๐ฝ๐๐ฅ๐๐ (6)
and
๐ข๐
1โ๐ข๐
= ๐๐ฝ0+๐ฝ1๐ฅ๐1+๐ฝ2๐ฅ๐2โฆ..+๐ฝ๐๐ฅ๐๐ (7)
๐ข๐ = ๐๐ฝ0+๐ฝ1๐ฅ๐1+๐ฝ2๐ฅ๐2โฆ..+๐ฝ๐๐ฅ๐๐ โ ๐ข๐๐๐ฝ0+๐ฝ1๐ฅ๐1+๐ฝ2๐ฅ๐2โฆ..+๐ฝ๐๐ฅ๐๐ (8)
๐ข๐ =
๐๐ฝ0+๐ฝ1๐ฅ๐1+๐ฝ2๐ฅ๐2โฆ..+๐ฝ๐๐ฅ๐๐
1+๐๐ฝ0+๐ฝ1๐ฅ๐1+๐ฝ2๐ฅ๐2โฆ..+๐ฝ๐๐ฅ๐๐
(9)
where ๐๐ [
๐ข๐
1โ๐ข๐
] is the log odds of the outcome, ๐ข๐ is the binary outcome, ๐๐1, ๐๐2, . . ๐๐ is the independent
variable, ๐ฝ๐1, ๐ฝ๐2 โฆ . ๐ฝ๐๐ is the regression coefficient and ๐ฝ0 is the intercept.
5. ๏ฒ ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 6, December 2023: 6862-6871
6866
2.7. Performance measures
The evaluation of performance on the algorithm used is defined from a matrix with the numbers of
instances correctly and incorrectly classified for each class are called a confusion matrix. It is a table with two
rows and two columns that displays the number of true positives (๐๐), true negatives (๐๐), false negatives
(๐น๐) and false positives (๐น๐). Accuracy was shown as the most used metric in the past study [29], [30]
calculates the percentage of correctly specified predictions which shows the effectiveness of the chosen
algorithm. Equation (10) shows the calculation of accuracy. Furthermore, sensitivity is a test's ability to specify
a positive class called a true positive rate. It is a ratio of true positives to the sum of true positives and false
negatives. Mathematically it can be stated as (11). In addition, this study also used specificity as one of the
performance metrics. Specificity is the test's ability to correctly identify the negative class called the true
negative rate. It is calculated by dividing the true negative by the sum of the true negative and the false positive.
It can be stated as (12). Nevertheless, precision is the last performance metric used in this study. Precision is the
ability of a test to assign the positive event to the positive class. Equation (13) states the calculation for precision.
๐ด๐๐๐ข๐๐๐๐ฆ =
๐๐+๐๐
(๐๐+๐๐+๐น๐+๐น๐)
(10)
๐๐๐๐ ๐๐ก๐๐ฃ๐๐ก๐ฆ =
๐๐
(๐๐+๐น๐)
(11)
๐๐๐๐๐๐๐๐๐๐ก๐ฆ =
๐๐
(๐๐+๐น๐)
(12)
๐๐๐๐๐๐ ๐๐๐ =
๐๐
(๐๐+๐น๐)
(13)
3. RESULTS AND DISCUSSION
The study features several experiments to find a promising binary classification solution for breast
cancer and prostate cancer binary classification solution. The analysis will be based on experiments integrating
filtering methods with logistic regression. The methods are hybrid IG (๐ =22)+LR, hybrid IG (๐ =10)+LR,
hybrid chiโsquare (๐ =22)+LR and hybrid chiโsquare (๐ =10)+LR. The measurement considers full sample
size, 100 sample size, and 75 sample size of breast cancer and prostate cancer data.
3.1. Result for top ๐ = 50 important features
Performance measures on the full sample size, 100 sample size, and 75 sample size applied to the
high-dimensional breast and prostate cancer data are shown in Table 2. The performance of feature selection
methods was assessed using classification accuracy, sensitivity, specificity, and precision. As demonstrated in
Table 2, the accuracy is highest for hybrid chiโsquare (๐ =50)+LR with 72.55% and hybrid IG (๐ =50)+LR
reporting at 66.67% for full sample sizes. The worst accuracy for no feature selection applied with only 56.86%
value can be found. Hybrid chiโsquare (๐ =50)+LR has good performance in sensitivity, where it obtained the
highest value of 84.62%, whereas hybrid IG (๐ =50)+LR and a reasonably good sensitivity value of 76.92%.
No feature selection method again has the worst sensitivity value with only 48.72%. However, it can be seen
when comparing specificity and precision that no feature selection approach outperforms others with 83.33%
and 90.48%, respectively but also obtained the worst accuracy and sensitivity value, which are 56.89% and
48.72%. For a sample size of 100, hybrid IG (๐ =50)+LR obtained the best value for all performance measures
with 66.67% accuracy, 72.22% sensitivity, 58.33% specificity, and 72.22% precision. No feature selection
approach is the worst feature selection method with the lowest accuracy, specificity, and precision of 60.00%,
41.67% and 65.00%, respectively. Furthermore, every feature selection method has the same sensitivity value
of 72.22%.
Meanwhile, for a sample size of 75, it can be observed that hybrid IG (๐ =50)+LR holds the highest
value for accuracy, sensitivity, and precision which are 86.96%, 94.44%, and 89.47% when applied to high-
dimensional breast cancer data. However, when considering the specificity, it can be seen that no feature
selection has the highest value among the rest, 80.00%. In summary, hybrid IG (๐ =50)+LR performs the best
for breast cancer data since it achieves the best performance for all criteria. Even though other methods have
been demonstrated to perform best in some performance metrics, such as no feature selection, which reaches
the highest specificity value when applied to breast cancer data, when considering other performance values,
hybrid IG (๐ =50)+LR still outperforms other methods in most performance metrics. As a result, the optimum
filter selection strategy is hybrid IG (๐ =50)+LR for breast cancer for 75 sample sizes considering the top
๐ =50 important features.
6. Int J Elec & Comp Eng ISSN: 2088-8708 ๏ฒ
Hybrid filtering methods for feature selection in high-dimensional cancer data (Siti Sarah Md Noh)
6867
For prostate cancer data with full sample sizes, hybrid chiโsquare (๐ =50)+LR shows the highest
accuracy, sensitivity, and precision percentages with 70.97%, 66.67%, and 71.43%. Contradict output can be
seen when these feature selection methods were applied to prostate cancer with sample sizes of 100 and 75,
where hybrid chiโsquare (๐ =50)+LR outperformed other methods with 80.00%, 90.00%, 75.00% and 64.29%
of accuracy, sensitivity, specificity, and precision respectively. Hence, hybrid chiโsquare (๐ =50)+LR is the
best method for full sample size, considering only 50 significant features for both high-dimensional breast and
prostate cancer data.
Table 2. Performance measures of each filter method applied to top ๐ =50 features and ๐ = full sample, 100
and 75 for breast cancer data
Data Performance
Measures
๐ = full sample size ๐ =100 ๐ =75
No
Feature
Selection
Hybrid IG
(d=22)
+LR (%)
Hybrid chi-
square
(d=22) + LR
(%)
No
Feature
Selection
Hybrid IG
(d=22) +LR
(%)
Hybrid
chi-square
(d=22) +
LR (%)
No
Feature
Selection
Hybrid
IG
(d=22)
+LR (%)
Hybrid chi-
square
(d=22) +
LR (%)
Breast
Cancer
Accuracy 56.86 66.67 72.55 60.00 66.67 63.33 47.83 86.96 56.52
Sensitivity 48.72 76.92 84.62 72.22 72.22 72.22 38.89 94.44 72.22
Specificity 83.33 33.33 33.33 41.67 58.33 50.00 80.00 60.00 0.00
Precision 90.48 78.95 80.49 65.00 72.22 68.42 87.50 89.47 72.22
Prostate
Cancer
Accuracy 58.06 61.29 70.97 30.00 76.67 80.00 56.52 39.13 60.87
Sensitivity 33.33 53.33 66.67 10.00 80.00 90.00 33.33 66.67 33.33
Specificity 81.25 68.75 75.00 40.00 75.00 75.00 71.43 21.43 78.57
Precision 62.50 61.54 71.43 7.69 61.54 64.29 42.86 35.29 50.00
3.2. Result for Top ๐ = 22 important features
Performance measures of each filter method applied to top ๐ =22 features, and ๐ = ๐๐ข๐๐ sample, 100
and 75 for high-dimensional breast cancer data is demonstrated in Table 3. As can be seen in Table 3, the
hybrid IG (๐ =22)+LR shows the highest accuracy and sensitivity of 68.63% and 79.49%, respectively.
However, in terms of specificity and precision, no feature selection outperforms other methods with
percentages of 83.33%. Thus, after considering only 22 features, filter hybrid IG (๐ =22)+LR is the optimal
feature selection approach for the full sample size of high-dimensional breast cancer with top ๐ =22 features
based on accuracy and sensitivity. Each method's performance metrics were assessed using 100 sample sizes
for high-dimensional breast cancer data. When comparing specificity, no feature selection, hybrid IG (๐ =22)
+LR and hybrid chi-square shared the same performance of 41.67%. The more stable performance of no feature
selection with a high value for each measure goes to achieving 60.00% accuracy, 72.22% sensitivity, and
65.00% precision. In addition, hybrid chiโsquare looks to perform the worst, scoring 46.67% in accuracy and
56.25% in precision.
Table 3. Performance measures of each filter method applied to top ๐ = 22 features and ๐ = full sample, 100
and 75 for high-dimensional breast and prostate cancer data
Data Performance
Measures
๐ = full sample size ๐ = 100 ๐ = 75
No
Feature
Selection
Hybrid
IG
(d=22)
+LR (%)
Hybrid
chi-
square
(d=22) +
LR (%)
No
Feature
Selection
Hybrid
IG
(d=22)
+LR
(%)
Hybrid
chi-square
(d=22) +
LR (%)
No
Feature
Selection
Hybrid
IG
(d=22)
+LR
(%)
Hybrid
chi-
square
(d=22) +
LR (%)
Breast
Cancer
Accuracy 56.86 68.63 56.86 60.00 50.00 46.67 47.83 82.61 73.91
Sensitivity 48.72 79.49 66.67 72.22 55.56 50.00 38.89 94.44 77.78
Specificity 83.33 33.33 25.00 41.67 41.67 41.67 80.00 40.00 60.00
Precision 90.48 79.49 74.29 65.00 58.82 56.25 87.50 85.00 87.50
Prostate
Cancer
Accuracy 58.06 74.19 90.32 30.00 96.67 90.00 56.52 86.96 91.30
Sensitivity 33.33 60.00 93.33 10.00 100.0 100.0 33.33 88.89 88.89
Specificity 81.25 87.50 87.50 40.00 95.00 85.00 71.43 85.71 92.86
Precision 62.50 81.82 87.50 7.69 90.91 76.92 42.86 80.00 88.89
For a sample size of 75, hybrid IG (๐ =22)+LR outperforms others in accuracy and sensitivity, with
82.61% accuracy and 94.44% sensitivity. Hence, it can be concluded that hybrid IG (๐ =22)+LR is the best
feature selection method for high-dimensional breast cancer data. In addition, no feature selection yielded the
lowest performance measure for accuracy, with 47.83% and 38.89% for sensitivity, making it the worst possible
method to be applied. It can be said that filter hybrid IG (๐ =22)+LR is the best feature selection method for high-
dimensional breast cancer. However, if the study wants to proceed with filtering techniques, it is better to use
7. ๏ฒ ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 6, December 2023: 6862-6871
6868
hybrid IG (๐ =22)+LR. It is the best feature selection method, with 22 features and 75 sample sizes for high-
dimensional breast cancer respectively. These feature selection procedures were also applied to prostate cancer,
and Table 3 displayed that hybrid ChiโSquare (๐ =22)+LR is the best procedure for top ๐ =22 and ๐ = full
sample and ๐ =75 as it obtained the highest accuracies, sensitivity, specificity, and specificity precision.
3.3. Result for top ๐ = ๐๐ important features
The performance measures of each filter method applied to top ๐ =10 features and ๐ = full sample,
100 and 75 for high-dimensional breast cancer data are illustrated in Table 4. The performance metrics for
high-dimensional breast cancer data using ten significant features by filtering technique (hybrid IG (๐ =10) +
LR and hybrid chiโsquare (๐ =10)+LR), for full sample size and then compared to no feature selection. In
high-dimensional breast data, no feature selection has the highest performance value for specificity and
precision with 83.33% and 90.48%, respectively but performs poorly for the other two criteria having the
lowest values for accuracy, 56.86%, and sensitivity, 48.72%. Hence, it can be concluded that no feature
selection performs the worst as it gives out a very imbalanced output. Hybrid IG (๐ =10)+LR had the best
performance as it gave out stable and high measurement values of 83.87%, 80.00%, 87.50%, and 85.71% for
accuracy, sensitivity, specificity, and precision, respectively.
For a sample size of 100, as shown in Table 4, performance metrics were applied to two data, high-
dimensional breast cancer data, where two different feature selection was used, which are filter (hybrid IG
(๐ =10)+LR and hybrid chiโsquare (๐ =10)+LR) by 100 sample size and 10 important features. The results
were compared with no feature selection. Hybrid IG (๐ =10)+LR and hybrid chiโsquare (๐ =10)+LR also
gave a good result, with both having the same values for all metrics, which are 66.67% accuracy, 88.89%
sensitivity, 33.33% specificity, and 66.67 precision.
For a sample size of 75, Table 4 illustrates the performance measure for high-dimensional breast
cancer data when applying hybrid IG (๐ =10)+LR and hybrid chiโsquare (๐ =10) + LR, taking only 10
important features for 75. The high-dimensional breast cancer data shows that no feature selection performs
poorly by achieving the lowest performance value for two out of four metrics, with 47.83% for accuracy and
38.89% for sensitivity, in terms of specificity and precision. Hence, it can be concluded that no feature selection
performs the worst for high-dimensional breast cancer data. The method that can be seen giving out a high and
consistent value in terms of accuracy, and sensitivity, is hybrid IG (๐ =10)+LR, with 69.57% and 77.78%.
hybrid IG (๐ =10)+LR is the best feature selection method when applied to high-dimensional prostate cancer
data with the configuration of ๐ = full sample, ๐ =100, and ๐ =75 with ๐ =10 features. Thus, it can be
concluded that hybrid IG (๐ =10)+LR is the best feature selection method for high-dimensional breast cancer
data. It can be said that the filter feature selection method works well for both high-dimensional breast cancer
and prostate cancer data.
Table 4. Performance measures of each filter method applied to top ๐=10 features and ๐=full sample, 100
and 75 for breast and prostate cancer data
Data Performance
Measures
๐ = full sample size ๐ = 100 ๐ = 75
No
Feature
Selection
Hybrid
IG,
(d=22)
+LR
(%)
Hybrid
chi-
square
(d=22) +
LR (%)
No
Feature
Selection
Hybrid
IG
(d=22)
+LR
(%)
Hybrid
chi-
square
(d=22)
+ LR (%)
No
Feature
Selection
Hybrid
IG
(d=22)
+LR
(%)
Hybrid
chi-
square
(d=22)
+ LR(%)
Breast
Cancer
Accuracy 56.86 82.35 80.39 60.00 66.67 66.67 47.83 69.57 65.22
Sensitivity 48.72 89.74 87.18 72.22 88.89 88.89 38.89 77.78 61.11
Specificity 83.33 58.33 58.33 41.67 33.33 33.33 80.00 40.00 80.00
Precision 90.48 87.50 87.18 65.00 66.67 66.67 87.50 82.35 91.67
Prostate
Cancer
Accuracy 58.06 83.87 77.42 30.00 96.67 90.00 56.52 95.65 91.30
Sensitivity 33.33 80.00 80.00 10.00 100.00 100.00 33.33 88.89 100.00
Specificity 81.25 87.50 75.00 40.00 95.00 85.00 71.43 100.00 85.71
Precision 62.50 85.71 75.00 7.69 90.91 76.92 42.86 100.00 81.82
3.4. Discussion
To classify the model's output, several performance evaluations were used. Metrics such as the gap
between the two sets, the accuracy of both sets, and the precision, recall, and F1-score value reveal differences
between the training and testing sets. The gap between the accuracy, training, and validation measures narrows
to a reasonable level during model training. All models can classify patients. However, this study finds that the
hybrid information gain with logistic regression (hybrid IG+LR) provides the best results for ๐ =100 and
๐ =75 with top ๐ =50, n=full sample and ๐ =75 with top ๐ =22 after training and testing data were applied
8. Int J Elec & Comp Eng ISSN: 2088-8708 ๏ฒ
Hybrid filtering methods for feature selection in high-dimensional cancer data (Siti Sarah Md Noh)
6869
on the breast cancer data. Suprisingly, hybrid IG+LR is the best method for all sample sizes with top ๐ =10.
Furthermore, hybrid IG+LR is still outperformed other methods when it was applied to high-dimensional
prostate cancer data. Specifically, the analysis involved ๐ =100 with top ๐ =22 and ๐ =100 and ๐ =75 with
top ๐ =10.
It is interesting to note that hybrid IG+LR performs the best for high-dimensional breast and prostate
cancer data since it achieves the best performance for all criteria. Even though other methods have been
demonstrated to perform at par within some performance metrics, no feature selection achieved the highest
specificity value when applied to high-dimensional breast cancer data. As a result, the optimum filter selection
strategy is hybrid IG+LR for high-dimensional breast cancer data for 75 sample sizes considering top ๐ =50
and ๐ =22 important features. In addition, filter hybrid chiโsquare+LR gives a feasible feature selection
solution for high-dimensional breast and prostate cancer data. Table 5 shows the top ๐ =22 and ๐ก๐๐ =10
important features of each filter method applied ๐ = full sample for breast and prostate cancer data. As shown
in Table 5, several the same features appeared in the top ๐ =22 and top ๐ =10 for hybrid IG+LR and hybrid
chi-square+LR when these methods were applied to high-dimensional breast cancer data. However, different
features were selected by hybrid chi-square+LR when this method was used for high-dimensional prostate
cancer data with top ๐ =22 and top ๐ =10 important features and full sample sizes.
The top 50 and 22 features outperformed the other configurations, with the highest classification
accuracies of 86.96% and 82.61%, respectively, after integrating the hybrid information gain and logistic
function (hybrid IG+LR) with a sample size of 75. In conclusion, this study shows that reducing in sample size
resulted in increased classification accuracy. This finding was supported by Eckstein et al. [31] and
Arbabshirani et al. [32] who also gave out the same result. Hence, it can be assumed that sample size does
influence the classification accuracy. This study revealed that sample sizes influenced the hybrid IG+LR and
performance and hybrid chiโsquare+LR. So, deciding the best feature selection methods to be applied to high-
dimensional data is still challenging. However, this study showed that the recommended feature selection
method is hybrid IG+LR for high-dimensional cancer data.
Table 5. Top ๐=22 and top ๐=10 essential features of each filter method applied ๐=full sample for breast and
prostate cancer data
Data Hybrid methods ๐ =full sample size
Breast Cancer
(n=168)
Hybrid IG(d=22)
+LR
x.g1CNS507, x.g1int1354, x.g7E05, x.g1int429, x.g1int372, x.g1int1131, x.g1int1662,
x.g1int1702, x.g1int382, x.g1CNS28, x.g1int1130, x.g1int154, x.g1int659, x.g1int373,
x.g1CNS229, x.g1int361, x.g2B01, x.g1int663, x.g1int895, x.g1int1414, x.g1int1220,
x.g1int380
Hybrid IG(d=10)
+LR
x.g1CNS507, x.g1int1354, x.g7E05, x.g1int429, x.g1int372, x.g1int1131, x.g1int1662,
x.g1int1702, x.g1int382, x.g1CNS28
Hybrid chi-square
(d=22)+LR
x.g1CNS507, x.g7E05, x.g1int1354, x.g1int372, x.g1int1131, x.g1int382, x.g1int1702,
x.g1int1662, x.g1CNS28, x.g1int1130, x.g1int659, x.g1int429, x.g1int1220, x.g1int373,
x.g1CNS229, x.g1int380, x.g1int369, x.g1int361, x.g2B01, x.g1int663, x.g3F01, x.g1int375
Hybrid chi-square
(d=10)+LR
x.g1CNS507, x.g7E05, x.g1int1354, x.g1int372, x.g1int1131, x.g1int382, x.g1int1702,
x.g1int1662, x.g1CNS28, x.g1int1130
Prostate
Cancer
(n=102)
Hybrid IG(d=22)
+LR
x.V7247, x.V10494, x.V6866, x.V6462, x.V9850, x.V8850, x.V4365, x.V5757, x.V6185,
x.V9172, x.V6620, x.V8566, x.V9034, x.V4241, x.V205, x.V5566, x.V5835, x.V12148,
x.V8058, x.V8724, x.V7557, x.V8965
Hybrid IG(d=10)
+LR
x.V7247, x.V10494, x.V6866, x.V6462, x.V9850, x.V8850, x.V4365, x.V5757, x.V6185,
x.V9172
Hybrid chi-square
(d=22)+LR
x.V10494, x.V9172, x.V10956, x.V6185, x.V4365, x.V11818, x.V9850, x.V6462, x.V8566,
x.V7247, x.V8103, x.V8850, x.V10260, x.V10138, x.V6620, x.V942, x.V3794, x.V9034,
x.V12153, x.V7820, x.V8965, x.V299
Hybrid chi-square
(d=10)+LR
x.V10494, x.V7247, x.V9850, x.V4365, x.V5757, x.V6185, x.V6866, x.V6462, x.V9172,
x.V8566
4. CONCLUSION
This paper attempts to provide more detailed investigations regarding high-dimensional breast cancer
and prostate data. This research compares filter feature selection methods in different sample sizes. Logistic
regression with hybrid IG+LR demonstrates improvement in binary classification accuracy, especially for
small sample sizes. It can be said that the filter feature selection method works well on high-dimensional breast
cancer and prostate cancer data. The result is significant for many features and a small sample size. In addition,
the sample size configuration affects the feature selection and classification performance. It resulted in
integrating hybrid IG+LR with a sample size of 75, with the top 50 and 22 important features outperforming
other configurations. Thus, this integration is expected to be used in other types of high-dimensional data. In
the future, evaluating the multiclass classification from a different domain is recommended.
9. ๏ฒ ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 13, No. 6, December 2023: 6862-6871
6870
ACKNOWLEDGEMENTS
The authors would like to acknowledge Institute for Big Data Analytics and Artificial Intelligence
(IBDAAI), Research Nexus UiTM (ReNeU), and specifically thanks to research management centre (RMC),
Universiti Teknologi MARA for financial support under the Malaysian Research Assessment (MyRA) Grant
Scheme 600-RMC/GPM LPHD 5/3 (055/2021).
REFERENCES
[1] J. Liang, L. Hou, Z. Luan, and W. Huang, โFeature selection with conditional mutual information considering feature interaction,โ
Symmetry, vol. 11, no. 7, Jul. 2019, doi: 10.3390/sym11070858.
[2] R. Sagban, H. A. Marhoon, and R. Alubady, โHybrid bat-ant colony optimization algorithm for rule-based feature selection in health
care,โ International Journal of Electrical and Computer Engineering (IJECE), vol. 10, no. 6, pp. 6655โ6663, Dec. 2020, doi:
10.11591/ijece.v10i6.pp6655-6663.
[3] M. T. Tally and H. Amintoosi, โA hybrid method of genetic algorithm and support vector machine for intrusion detection,โ
International Journal of Electrical and Computer Engineering (IJECE), vol. 11, no. 1, pp. 900โ908, Feb. 2021, doi:
10.11591/ijece.v11i1.pp900-908.
[4] N. Ibrahim and A. N. Kamarudin, โAssessing time-dependent performance of a feature selection method using correlation sharing
T-statistics (corT) for heart failure data classification,โ AIP Conference Proceedings, 2023. doi: 10.1063/5.0109918.
[5] Y. Guo, F.-L. Chung, G. Li, and L. Zhang, โMulti-label bioinformatics data classification with ensemble embedded feature
selection,โ IEEE Access, vol. 7, pp. 103863โ103875, 2019, doi: 10.1109/ACCESS.2019.2931035.
[6] R. Cekik and A. K. Uysal, โA novel filter feature selection method using rough set for short text data,โ Expert Systems with
Applications, vol. 160, Dec. 2020, doi: 10.1016/j.eswa.2020.113691.
[7] T. W. Way, B. Sahiner, L. M. Hadjiiski, and H.-P. Chan, โEffect of finite sample size on feature selection and classification: a
simulation study,โ Medical Physics, vol. 37, no. 2, pp. 907โ920, Jan. 2010, doi: 10.1118/1.3284974.
[8] Y. B. Wah, N. Ibrahim, H. A. Hamid, S. A. Rahman, and S. Fong, โFeature selection methods: Case of filter and wrapper approaches
for maximising classification accuracy,โ Pertanika Journal of Science and Technology, vol. 26, no. 1, pp. 329โ340, 2018.
[9] E. Gravier et al., โA prognostic DNA signature for T1T2 node-negative breast cancer patients,โ Genes, Chromosomes and Cancer,
vol. 49, no. 12, pp. 1125โ1134, Dec. 2010, doi: 10.1002/gcc.20820.
[10] D. Singh et al., โGene expression correlates of clinical prostate cancer behavior,โ Cancer Cell, vol. 1, no. 2, pp. 203โ209, Mar.
2002, doi: 10.1016/S1535-6108(02)00030-2.
[11] Q. H. Nguyen et al., โInfluence of data splitting on performance of machine learning models in prediction of shear strength of soil,โ
Mathematical Problems in Engineering, vol. 2021, pp. 1โ15, Feb. 2021, doi: 10.1155/2021/4832864.
[12] B. Vrigazova, โThe proportion for splitting data into training and test set for the bootstrap in classification problems,โ Business
Systems Research Journal, vol. 12, no. 1, pp. 228โ242, May 2021, doi: 10.2478/bsrj-2021-0015.
[13] V. Nasteski, โAn overview of the supervised machine learning methods,โ Horizons.B, vol. 4, pp. 51โ62, Dec. 2017, doi:
10.20544/HORIZONS.B.04.1.17.P05.
[14] M. Abd-elnaby, M. Alfonse, and M. Roushdy, โA hybrid mutual information-LASSO-genetic algorithm selection approach for
classifying breast cancer,โ in Digital Transformation Technology, 2022, pp. 547โ560. doi: 10.1007/978-981-16-2275-5_36.
[15] S. L. Verghese, I. Y. Liao, T. H. Maul, and S. Y. Chong, โAn empirical study of several information theoretic based feature
extraction methods for classifying high dimensional low sample size data,โ IEEE Access, vol. 9, pp. 69157โ69172, 2021, doi:
10.1109/ACCESS.2021.3077958.
[16] M. Rostami, S. Forouzandeh, K. Berahmand, M. Soltani, M. Shahsavari, and M. Oussalah, โGene selection for microarray data
classification via multi-objective graph theoretic-based method,โ Artificial Intelligence in Medicine, vol. 123, Jan. 2022, doi:
10.1016/j.artmed.2021.102228.
[17] B. Ghaddar and J. Naoum-Sawaya, โHigh dimensional data classification and feature selection using support vector machines,โ
European Journal of Operational Research, vol. 265, no. 3, pp. 993โ1004, Mar. 2018, doi: 10.1016/j.ejor.2017.08.040.
[18] H. Lu, J. Chen, K. Yan, Q. Jin, Y. Xue, and Z. Gao, โA hybrid feature selection algorithm for gene expression data classification,โ
Neurocomputing, vol. 256, pp. 56โ62, Sep. 2017, doi: 10.1016/j.neucom.2016.07.080.
[19] N. Ibrahim, โVariable selection methods for classification : application to metabolomics data,โ University of Liverpool, no. March,
2020.
[20] J. Cho, K. Lee, E. Shin, G. Choy, and S. Do, โHow much data is needed to train a medical image deep learning system to achieve
necessary high accuracy?,โ Prepr. arXiv.1511.06348, Nov. 2015.
[21] G. Heinze, C. Wallisch, and D. Dunkler, โVariable selection - a review and recommendations for the practicing statistician,โ
Biometrical Journal, vol. 60, no. 3, pp. 431โ449, May 2018, doi: 10.1002/bimj.201700067.
[22] G. Chandrashekar and F. Sahin, โA survey on feature selection methods,โ Computers & Electrical Engineering, vol. 40, no. 1,
pp. 16โ28, Jan. 2014, doi: 10.1016/j.compeleceng.2013.11.024.
[23] M. Al Khaldy, โResampling imbalanced class and the effectiveness of feature selection methods for heart failure dataset,โ
International Robotics & Automation Journal, vol. 4, no. 1, Feb. 2018, doi: 10.15406/iratj.2018.04.00090.
[24] R. Porkodi, โComparison of filter based feature selection algorithms : An overview,โ International Journal of Innovative Research
in Technology & Science(IJIRTS), vol. 2, no. 2, pp. 108โ113, 2014.
[25] C. De Stefano, F. Fontanella, and A. Scotto di Freca, โFeature selection in high dimensional data by a filter-based genetic algorithm,โ
in EvoApplications 2017: Applications of Evolutionary Computation, 2017, pp. 506โ521. doi: 10.1007/978-3-319-55849-3_33.
[26] S. DeepaLakshmi and T. Velmurugan, โEmpirical study of feature selection methods for high dimensional data,โ Indian Journal of
Science and Technology, vol. 9, no. 39, Oct. 2016, doi: 10.17485/ijst/2016/v9i39/90599.
[27] A. Pedersen et al., โMissing data and multiple imputation in clinical epidemiological research,โ Clinical Epidemiology, vol. 9, pp.
157โ166, Mar. 2017, doi: 10.2147/CLEP.S129785.
[28] U. Grรถmping, โPractical guide to logistic regression,โ Journal of Statistical Software, vol. 71, no. 3, 2016, doi:
10.18637/jss.v071.b03.
[29] M. Mafarja and S. Mirjalili, โWhale optimization approaches for wrapper feature selection,โ Applied Soft Computing, vol. 62,
pp. 441โ453, Jan. 2018, doi: 10.1016/j.asoc.2017.11.006.
[30] A. Mirzaei, Y. Mohsenzadeh, and H. Sheikhzadeh, โVariational relevant sample-feature machine: A fully bayesian approach for
embedded feature selection,โ Neurocomputing, vol. 241, pp. 181โ190, Jun. 2017, doi: 10.1016/j.neucom.2017.02.057.
10. Int J Elec & Comp Eng ISSN: 2088-8708 ๏ฒ
Hybrid filtering methods for feature selection in high-dimensional cancer data (Siti Sarah Md Noh)
6871
[31] F. Eckstein et al., โEffect of training set sample size on the agreement, accuracy, and sensitivity to change of automated U-net-
based cartilage thickness analysis,โ Osteoarthritis and Cartilage, vol. 29, pp. S326--S327, Apr. 2021, doi:
10.1016/j.joca.2021.02.427.
[32] M. R. Arbabshirani, S. Plis, J. Sui, and V. D. Calhoun, โSingle subject prediction of brain disorders in neuroimaging: Promises and
pitfalls,โ NeuroImage, vol. 145, pp. 137โ165, Jan. 2017, doi: 10.1016/j.neuroimage.2016.02.079.
BIOGRAPHIES OF AUTHORS
Siti Sarah Md Noh received the Diploma in Statistics in 2018 and Bachelor of
Science (Hons) (Statistics) in 2021 at Universiti Teknologi MARA. She is now currently a
student, undergoing Master of Science Applied Statistics at Universiti Teknologi MARA Shah
Alam. Her research interest includes data mining, applied statistics and medical statistics. She
can be contacted at email: sarah97noh@gmail.com.
Nurain Ibrahim received the Bachelor of Science (Hons) (Statistics), Master of
Science Applied Statistics from Universiti Teknologi MARA in 2013 and 2014, and PhD in
Biostatistics from University of Liverpool, United Kingdom in 2020. She is currently a senior
lecturer with the School of Mathematical Sciences, College of Computing, Informatics and
Media, Universiti Teknologi MARA, Malaysia and she also is currently an associate fellow
researcher at the Institute for Big Data Analytics and Artificial Intelligence (IBDAAI). Her
research interests include biostatistics, data mining with multivariate data analysis, and
applied statistics. She has experienced in teaching A-level students for UK, US and Germany
bounds at INTEC Education College in 2013 to 2014. She can be contacted at email:
nurainibrahim@uitm.edu.my.
Mahayaudin M. Mansor obtained a B.Sc.App. (Hons) (Mathematical
Modelling) from Universiti Sains Malaysia in 2005, a M.Sc. (Quantitative Science) from
Universiti Teknologi MARA in 2011, and a Ph.D. in Statistics from The University of
Adelaide, Australia in 2018. He worked in banking and insurance before teaching Statistics
and Business Mathematics at the Universiti Tun Abdul Razak Malaysia. After completing his
PhD studies, he worked as a researcher specializing in data management and quantitative
research at the Australian Centre for Precision Health, University of South Australia. He is
currently a senior lecturer in Statistics at the School of Mathematical Sciences, Universiti
Teknologi MARA and an external supervisor register at the School of Mathematical Sciences,
The University of Adelaide. His research interest focuses on data analytics, time series,
R computing, quantitative finance and vulnerable populations. He can be contacted at email:
maha@uitm.edu.my.
Marina Yusoff is currently a senior fellow researcher at the Institute for Big Data
Analytics and Artificial Intelligence (IBDAAI) and Associate Professor of Computer and
Mathematical Sciences, Universiti Teknologi MARA Shah Alam, Malaysia. She has a Ph.D.
in Information Technology and Quantitative Sciences (Intelligent Systems). She previously
worked as a Senior Executive of Information Technology in SIRIM Berhad, Malaysia. She is
most interested in multidisciplinary research, artificial intelligence, nature-inspired computing
optimization, and data analytics. She applied and modified AI methods in many research and
projects, including recent hybrid deep learning, particle swarm optimization, genetic
algorithm, ant colony, and cuckoo search for many real-world problems, medical and
industrial projects. Her recent projects are data analytic optimizer, audio, and image pattern
recognition. She has many impact journal publications and contributes as an examiner and
reviewer to many conferences, journals, and universities' academic activities. She can be
contacted at marina998@uitm.edu.my.