The goal of our paper is to obtain superior accuracy of different classifiers or multi-classifiers fusion in diagnosing Hepatitis using world wide data set from Ljubljana University. We present an implementation among some of the classification methods, which are defined as the best algorithms in medical field. Then we apply a fusion between classifiers to get the best multi-classifier fusion approach. By using confusion matrix to get classification accuracy which built in 10-fold cross validation technique. The experimental results show that using multi-classifiers fusion achieved better accuracy than the single one
Highly Reliable Hepatitis Diagnosis with multi classifiersahmedbohy
The diagnosis of some diseases like hepatitis is very difficult task for a doctor, where doctors usually determine decision by comparing the current test results of patients with another one who has the same condition. Hepatitis is one of the most common diseases among Egyptians; as it represents 22% of hepatitis cases around the world. This motivates us for suggesting new methods to improve the outcomes of existing approaches, as well as to help doctors and specialists to diagnose hepatitis disease survival
Classification techniques; Fusion; WEKA.
Performance evaluation of hepatitis diagnosis using single and multi classifi...ahmedbohy
The goal of our paper is to obtain superior accuracy of different classifiers or multi-classifiers fusion in diagnosing Hepatitis using world wide data set from Ljubljana University. We present an implementation among some of the classification methods which are defined as the best algorithms in medical field. Then we apply a fusion between classifiers to get the best multi-classifier fusion approach. By using confusion matrix to get classification accuracy which built in 10-fold cross validation technique. The experimental results show that for all data sets (complete, reduced, and no missing value) using multi-classifiers fusion achieved better accuracy than the single ones
IRJET - Survey on Analysis of Breast Cancer PredictionIRJET Journal
This document compares three machine learning techniques - Support Vector Machine (SVM), Random Forest (RF), and Naive Bayes (NB) - for predicting breast cancer using a dataset of 198 patient records. It finds that SVM achieved the highest accuracy of 96.97% for classification, followed by RF at 96.45% and NB at 95.45%. SVM also had the highest recall rate at 0.97, indicating it was best at correctly identifying malignant tumors. While NB had the lowest precision of 0.92, meaning it incorrectly identified some benign cases as malignant, all three techniques showed high performance in predicting breast cancer.
Measuring Improvement in Access to Complete Data in Healthcare Collaborative ...Nurul Emran
1) The document evaluates a Collaborative Integrated Database System (COLLIDS) in terms of improving access to complete healthcare data from multiple providers.
2) Statistical analysis found a significant improvement in data completeness after implementing COLLIDS, with over 50% completeness for most providers.
3) COLLIDS was shown to benefit all participating healthcare providers by improving access to more complete patient records compared to individual provider access alone.
Cancer prognosis prediction using balanced stratified samplingijscai
High accuracy in cancer prediction is important to improve the quality of the treatment and to improve the
rate of survivability of patients. As the data volume is increasing rapidly in the healthcare research, the
analytical challenge exists in double. The use of effective sampling technique in classification algorithms
always yields good prediction accuracy. The SEER public use cancer database provides various prominent
class labels for prognosis prediction. The main objective of this paper is to find the effect of sampling
techniques in classifying the prognosis variable and propose an ideal sampling method based on the
outcome of the experimentation. In the first phase of this work the traditional random sampling and
stratified sampling techniques have been used. At the next level the balanced stratified sampling with
variations as per the choice of the prognosis class labels have been tested. Much of the initial time has been
focused on performing the pre-processing of the SEER data set. The classification model for
experimentation has been built using the breast cancer, respiratory cancer and mixed cancer data sets with
three traditional classifiers namely Decision Tree, Naïve Bayes and K-Nearest Neighbour. The three
prognosis factors survival, stage and metastasis have been used as class labels for experimental
comparisons. The results shows a steady increase in the prediction accuracy of balanced stratified model
as the sample size increases, but the traditional approach fluctuates before the optimum results.
Classification of Health Care Data Using Machine Learning Techniqueinventionjournals
Diagnosis of any diseases using intelligent system produces high accuracy and treated as classification problem, also classification of health care data using machine learning techniques may be used as intelligent health care system (IHSM). Diagnosis of heart disease is a major issue and commonly found in human being due to changing life style and also need to be identified in advance. This paper explores various machine learning techniques to classify heart data downloaded from UCI repository site. After applying feature selection technique (FST), a decision tree based technique: CART is producing high accuracy with four features followed by other classification techniques.
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Sunil Nair
The document summarizes research on classifying breast cancer datasets using decision trees. The researchers used a Wisconsin breast cancer dataset containing 699 instances with 10 attributes plus a class attribute. They preprocessed the data to handle missing values, compared various classification methods, and achieved the best accuracy of 97% using decision trees with attribute selection. Issues addressed included unbalanced classes and future work proposed methods like clustering and multiple classifiers to further improve accuracy.
Highly Reliable Hepatitis Diagnosis with multi classifiersahmedbohy
The diagnosis of some diseases like hepatitis is very difficult task for a doctor, where doctors usually determine decision by comparing the current test results of patients with another one who has the same condition. Hepatitis is one of the most common diseases among Egyptians; as it represents 22% of hepatitis cases around the world. This motivates us for suggesting new methods to improve the outcomes of existing approaches, as well as to help doctors and specialists to diagnose hepatitis disease survival
Classification techniques; Fusion; WEKA.
Performance evaluation of hepatitis diagnosis using single and multi classifi...ahmedbohy
The goal of our paper is to obtain superior accuracy of different classifiers or multi-classifiers fusion in diagnosing Hepatitis using world wide data set from Ljubljana University. We present an implementation among some of the classification methods which are defined as the best algorithms in medical field. Then we apply a fusion between classifiers to get the best multi-classifier fusion approach. By using confusion matrix to get classification accuracy which built in 10-fold cross validation technique. The experimental results show that for all data sets (complete, reduced, and no missing value) using multi-classifiers fusion achieved better accuracy than the single ones
IRJET - Survey on Analysis of Breast Cancer PredictionIRJET Journal
This document compares three machine learning techniques - Support Vector Machine (SVM), Random Forest (RF), and Naive Bayes (NB) - for predicting breast cancer using a dataset of 198 patient records. It finds that SVM achieved the highest accuracy of 96.97% for classification, followed by RF at 96.45% and NB at 95.45%. SVM also had the highest recall rate at 0.97, indicating it was best at correctly identifying malignant tumors. While NB had the lowest precision of 0.92, meaning it incorrectly identified some benign cases as malignant, all three techniques showed high performance in predicting breast cancer.
Measuring Improvement in Access to Complete Data in Healthcare Collaborative ...Nurul Emran
1) The document evaluates a Collaborative Integrated Database System (COLLIDS) in terms of improving access to complete healthcare data from multiple providers.
2) Statistical analysis found a significant improvement in data completeness after implementing COLLIDS, with over 50% completeness for most providers.
3) COLLIDS was shown to benefit all participating healthcare providers by improving access to more complete patient records compared to individual provider access alone.
Cancer prognosis prediction using balanced stratified samplingijscai
High accuracy in cancer prediction is important to improve the quality of the treatment and to improve the
rate of survivability of patients. As the data volume is increasing rapidly in the healthcare research, the
analytical challenge exists in double. The use of effective sampling technique in classification algorithms
always yields good prediction accuracy. The SEER public use cancer database provides various prominent
class labels for prognosis prediction. The main objective of this paper is to find the effect of sampling
techniques in classifying the prognosis variable and propose an ideal sampling method based on the
outcome of the experimentation. In the first phase of this work the traditional random sampling and
stratified sampling techniques have been used. At the next level the balanced stratified sampling with
variations as per the choice of the prognosis class labels have been tested. Much of the initial time has been
focused on performing the pre-processing of the SEER data set. The classification model for
experimentation has been built using the breast cancer, respiratory cancer and mixed cancer data sets with
three traditional classifiers namely Decision Tree, Naïve Bayes and K-Nearest Neighbour. The three
prognosis factors survival, stage and metastasis have been used as class labels for experimental
comparisons. The results shows a steady increase in the prediction accuracy of balanced stratified model
as the sample size increases, but the traditional approach fluctuates before the optimum results.
Classification of Health Care Data Using Machine Learning Techniqueinventionjournals
Diagnosis of any diseases using intelligent system produces high accuracy and treated as classification problem, also classification of health care data using machine learning techniques may be used as intelligent health care system (IHSM). Diagnosis of heart disease is a major issue and commonly found in human being due to changing life style and also need to be identified in advance. This paper explores various machine learning techniques to classify heart data downloaded from UCI repository site. After applying feature selection technique (FST), a decision tree based technique: CART is producing high accuracy with four features followed by other classification techniques.
Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...Sunil Nair
The document summarizes research on classifying breast cancer datasets using decision trees. The researchers used a Wisconsin breast cancer dataset containing 699 instances with 10 attributes plus a class attribute. They preprocessed the data to handle missing values, compared various classification methods, and achieved the best accuracy of 97% using decision trees with attribute selection. Issues addressed included unbalanced classes and future work proposed methods like clustering and multiple classifiers to further improve accuracy.
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET Journal
This document describes a study that uses supervised machine learning algorithms to predict breast cancer. Three algorithms - decision tree, logistic regression, and random forest - are applied to preprocessed breast cancer data. The random forest model achieved the best accuracy at 98.6% for predicting whether a tumor was benign or malignant. The study aims to develop an early prediction system for breast cancer using machine learning techniques.
Early hospital mortality prediction using vital signalsReza Sadeghi
Early Hospital Mortality Prediction using Vital Signals
1) The document proposes methods for early prediction of mortality in ICU patients using only the initial one hour of heart rate signals without dependency on other clinical records which can contain missing values.
2) A classification model is developed using 12 statistical features extracted from heart rate signals and evaluated on a MIMIC-III dataset, with classifiers such as random forest and SVM achieving precision and recall over 0.95.
3) Interpretable classifiers like decision trees performed reasonably well but with lower performance, highlighting the transparency-accuracy tradeoff in early mortality prediction.
We conducted comparative analysis of different supervised dimension reduction techniques by integrating a set of different data splitting algorithms and demonstrate the relative efficacy of learning algorithms dependence of sample complexity. The issue of sample complexity discussed in the dependence of data splitting algorithms. In line with the expectations, every supervised learning classifier demonstrated different capability for different data splitting algorithms and no way to calculate overall ranking of techniques was directly available. We specifically focused the classifier ranking dependence of data splitting algorithms and devised a model built on weighted average rank Weighted Mean Rank Risk Adjusted Model (WMRRAM) for consent ranking of learning classifier algorithms.
The document discusses the class imbalance problem in machine learning where the number of samples in one class (the positive or minority class) is much less than the samples in another class (the negative or majority class). This can cause classifiers to be biased towards the majority class. Two approaches to address this problem are discussed: sampling-based approaches and cost-function based approaches. Sampling approaches like oversampling, undersampling, and SMOTE are explained in detail. Oversampling adds more samples from the minority class, while undersampling removes samples from the majority class. SMOTE generates new synthetic samples for the minority class. The document advocates for these sampling techniques to help machine learning algorithms better identify the minority class samples.
Handling Imbalanced Data: SMOTE vs. Random UndersamplingIRJET Journal
This document compares different techniques for handling imbalanced data, including random undersampling and SMOTE (Synthetic Minority Over-sampling Technique), using two machine learning classifiers - Random Forest and XGBoost. It finds that XGBoost generally performs better than Random Forest across different sampling techniques in terms of metrics like AUC, sensitivity and specificity. Random undersampling with XGBoost provided the most balanced results on the randomly generated imbalanced dataset used in this study. SMOTE without proper validation gave the best numerical results but likely overfit the data.
Multiclass classification of imbalanced dataSaurabhWani6
Pydata Talk on Classification of imbalanced data.
It is an overview of concepts for better classification in imbalanced datasets.
Resampling techniques are introduced along with bagging and boosting methods.
This document provides an overview of descriptive statistics, inferential statistics, and regression analysis using PASW Statistics software. It discusses topics such as frequency analysis, measures of central tendency, hypothesis testing, t-tests, ANOVA, chi-square tests, correlation, and linear regression. The document is divided into multiple parts that cover opening and manipulating data files, descriptive statistics, tests of significance, regression analysis, and chi-square/ANOVA. It also discusses importing/exporting data and using scripts in PASW Statistics.
Chronic Kidney Disease prediction is one of the most important issues in healthcare analytics. The most interesting and challenging tasks in day to day life is prediction in medical field. In this paper, we employ some machine learning techniques for predicting the chronic kidney disease using clinical data. We use three machine learning algorithms such as Decision Tree(DT) algorithm, Naive Bayesian (NB) algorithm. The performance of the above models are compared with each other in order to select the best classifier in predicting the chronic kidney disease for given dataset.
This presentation is on using repeated measures design in the area of social sciences, behavioural sciences, management, sports, physical education etc.
This document discusses evaluating measurement capabilities using gauge repeatability and reproducibility (gauge R&R) analysis of variance (ANOVA) methods. It begins by defining types of measurement errors like accuracy, precision, repeatability and reproducibility. It then describes how ANOVA separates variations due to operators and tools. The document provides an example gauge R&R study using ANOVA calculations in Minitab software to evaluate a measurement system. It finds that less than 1% of the total variation is due to the measurement system, indicating it is acceptable for use. In conclusion, gauge R&R ANOVA analysis can determine if a measurement system produces precise and accurate data to improve process results.
Statistical Prediction for analyzing Epidemiological Characteristics of COVID...Nuwan Sriyantha Bandara
In this presentation, we propose a novel integrated model for analyzing the characteristics of the epidemiological curve of COVID-19 by utilizing an enhanced compartmental statistical prediction model which is developed conferring susceptible-infectious susceptible (SIS) model, susceptible-infectious-removed (SIR) model, Dirichlet process model, and the interpretive structural model.
The oral presentation of the research has been presented at the International Research Conference 2020 of Sri Lanka Technological Campus on 17th June, 2020.
Abstract Link: http://repo.sltc.ac.lk/handle/1/82
Exploratory Data Analysis (EDA) was promoted by John Tukey in 1977 to encourage visually examining data without hypotheses. EDA uses graphical and non-graphical techniques like histograms, scatter plots, box plots to summarize variable characteristics. EDA allows understanding data distributions and relationships without models through inspection and information graphics. Common EDA goals are describing typical values, variability, distributions, and relationships between variables.
Boost model accuracy of imbalanced covid 19 mortality predictionBindhuBhargaviTalasi
The document proposes using a GAN-based oversampling technique to boost the accuracy of an imbalanced COVID-19 mortality prediction model. It analyzes patient data to find correlations between features like gender and age with mortality. A GAN is used to generate synthetic minority class data to address data imbalance. The GAN-enhanced random forest model is evaluated using metrics like recall, precision and F1 score, and shows improved performance over the base model at predicting COVID-19 mortality.
Classification of medical datasets using back propagation neural network powe...IJECEIAES
The classification is a one of the most indispensable domains in the data mining and machine learning. The classification process has a good reputation in the area of diseases diagnosis by computer systems where the progress in smart technologies of computer can be invested in diagnosing various diseases based on data of real patients documented in databases. The paper introduced a methodology for diagnosing a set of diseases including two types of cancer (breast cancer and lung), two datasets for diabetes and heart attack. Back Propagation Neural Network plays the role of classifier. The performance of neural net is enhanced by using the genetic algorithm which provides the classifier with the optimal features to raise the classification rate to the highest possible. The system showed high efficiency in dealing with databases differs from each other in size, number of features and nature of the data and this is what the results illustrated, where the ratio of the classification reached to 100% in most datasets).
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...ijiert bestjournal
The issue of incomplete data exists across the enti re field of data mining. In this paper,Mean Imputation,Median Imputation and Standard Dev iation Imputation are used to deal with challenges of incomplete data on classifi cation problems. By using different imputation methods converts incomplete dataset in t o the complete dataset. On complete dataset by applying the suitable Imputatio n Method and comparing the percentage error of Imputation Method and comparing the result
This document outlines key concepts for analyzing qualitative and quantitative data. It discusses preparing data through editing, coding and inserting into a matrix. Graphical techniques like histograms, scatter plots and box plots are presented for depicting individual, comparative and relational data. Measures of central tendency, dispersion, relationships and models are explained including mean, median, standard deviation, correlation, and linear and non-linear models. The goal is for students to understand how to analyze data using appropriate statistical techniques and data visualization.
Novel Frequency Domain Classification Algorithm Based On Parameter Weight Fac...ahmedbohy
This work proposes two new classification techniques for predicting hepatitis mortality using a dataset from Ljubljana University. The first technique estimates missing values by finding the minimum difference between attribute values of the instance with missing values and other instances. The second technique computes a weight factor for each attribute by correlating the decision attribute with other attributes, and classifies new instances using correlation in the frequency domain on the top seven attributes. Experimental results on 155 instances show the frequency domain technique achieved a mean accuracy of 90.4%, higher than the first technique and previous methods.
PERFORMANCE ANALYSIS OF MULTICLASS SUPPORT VECTOR MACHINE CLASSIFICATION FOR ...ijcsa
Automatic diagnosis of coronary heart disease helps the doctor to support in decision making a diagnosis.
Coronary heart disease have some types or levels. Referring to the UCI Repository dataset, it divided into 4
types or levels that are labeled numbers 1-4 (low, medium, high and serious). The diagnosis models can be
analyzed with multi class classification approach. One of multi class classification approach used, one of
which is a support vector machine (SVM). The SVM use due to strong performance of SVM in binary classification. This research study multiclass performance classification support vector machine to diagnose the type or level of coronary heart disease. Coronary heart disease patient data taken from the
UCI Repository. Stages in this study is preprocessing, which consist of, to normalizing the data, divide the data into data training and testing. The next stage of multiclass classification and performance analysis.This study uses multiclass SVM algorithm, namely: Binary Tree Support Vector Machine (BTSVM), OneAgainst-One (OAO), One-Against-All (OAA), Decision Direct Acyclic Graph (DDAG) and Exhaustive
Output Error Correction Code ( ECOC). Performance parameter used is recall, precision, F-measure and
Overall accuracy. The experiment results showed that the multiclass SVM classification algorithm with the
algorithm BT-SVM, OAA-SVM and the ECOC-SVM,gave the highest Recall in the diagnosis of type or
healthy level with a value above 90%, precision 82.143% and 86.793% F-measure,. For all kinds of
algorithms, except binary OAA-SVM algorithm gave the highest recall 0.0% for the type or level of sickhigh
and sick-serious, and ECOC- SVM algorithm gave the highest recall 0.0% for sick-medium and sickserious.
While the type or level other, the performance of recall, precision and F-measure between 20% -
30%,. The conclusion that can be drawn is that the approach to the multiclass classification algorithm BTSVM,
OAO-SVM, DDAG-SVM to diagnosis the type or level of coronary heart disease provides better performance, than the binary classification approach.
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET Journal
This document describes a study that uses supervised machine learning algorithms to predict breast cancer. Three algorithms - decision tree, logistic regression, and random forest - are applied to preprocessed breast cancer data. The random forest model achieved the best accuracy at 98.6% for predicting whether a tumor was benign or malignant. The study aims to develop an early prediction system for breast cancer using machine learning techniques.
Early hospital mortality prediction using vital signalsReza Sadeghi
Early Hospital Mortality Prediction using Vital Signals
1) The document proposes methods for early prediction of mortality in ICU patients using only the initial one hour of heart rate signals without dependency on other clinical records which can contain missing values.
2) A classification model is developed using 12 statistical features extracted from heart rate signals and evaluated on a MIMIC-III dataset, with classifiers such as random forest and SVM achieving precision and recall over 0.95.
3) Interpretable classifiers like decision trees performed reasonably well but with lower performance, highlighting the transparency-accuracy tradeoff in early mortality prediction.
We conducted comparative analysis of different supervised dimension reduction techniques by integrating a set of different data splitting algorithms and demonstrate the relative efficacy of learning algorithms dependence of sample complexity. The issue of sample complexity discussed in the dependence of data splitting algorithms. In line with the expectations, every supervised learning classifier demonstrated different capability for different data splitting algorithms and no way to calculate overall ranking of techniques was directly available. We specifically focused the classifier ranking dependence of data splitting algorithms and devised a model built on weighted average rank Weighted Mean Rank Risk Adjusted Model (WMRRAM) for consent ranking of learning classifier algorithms.
The document discusses the class imbalance problem in machine learning where the number of samples in one class (the positive or minority class) is much less than the samples in another class (the negative or majority class). This can cause classifiers to be biased towards the majority class. Two approaches to address this problem are discussed: sampling-based approaches and cost-function based approaches. Sampling approaches like oversampling, undersampling, and SMOTE are explained in detail. Oversampling adds more samples from the minority class, while undersampling removes samples from the majority class. SMOTE generates new synthetic samples for the minority class. The document advocates for these sampling techniques to help machine learning algorithms better identify the minority class samples.
Handling Imbalanced Data: SMOTE vs. Random UndersamplingIRJET Journal
This document compares different techniques for handling imbalanced data, including random undersampling and SMOTE (Synthetic Minority Over-sampling Technique), using two machine learning classifiers - Random Forest and XGBoost. It finds that XGBoost generally performs better than Random Forest across different sampling techniques in terms of metrics like AUC, sensitivity and specificity. Random undersampling with XGBoost provided the most balanced results on the randomly generated imbalanced dataset used in this study. SMOTE without proper validation gave the best numerical results but likely overfit the data.
Multiclass classification of imbalanced dataSaurabhWani6
Pydata Talk on Classification of imbalanced data.
It is an overview of concepts for better classification in imbalanced datasets.
Resampling techniques are introduced along with bagging and boosting methods.
This document provides an overview of descriptive statistics, inferential statistics, and regression analysis using PASW Statistics software. It discusses topics such as frequency analysis, measures of central tendency, hypothesis testing, t-tests, ANOVA, chi-square tests, correlation, and linear regression. The document is divided into multiple parts that cover opening and manipulating data files, descriptive statistics, tests of significance, regression analysis, and chi-square/ANOVA. It also discusses importing/exporting data and using scripts in PASW Statistics.
Chronic Kidney Disease prediction is one of the most important issues in healthcare analytics. The most interesting and challenging tasks in day to day life is prediction in medical field. In this paper, we employ some machine learning techniques for predicting the chronic kidney disease using clinical data. We use three machine learning algorithms such as Decision Tree(DT) algorithm, Naive Bayesian (NB) algorithm. The performance of the above models are compared with each other in order to select the best classifier in predicting the chronic kidney disease for given dataset.
This presentation is on using repeated measures design in the area of social sciences, behavioural sciences, management, sports, physical education etc.
This document discusses evaluating measurement capabilities using gauge repeatability and reproducibility (gauge R&R) analysis of variance (ANOVA) methods. It begins by defining types of measurement errors like accuracy, precision, repeatability and reproducibility. It then describes how ANOVA separates variations due to operators and tools. The document provides an example gauge R&R study using ANOVA calculations in Minitab software to evaluate a measurement system. It finds that less than 1% of the total variation is due to the measurement system, indicating it is acceptable for use. In conclusion, gauge R&R ANOVA analysis can determine if a measurement system produces precise and accurate data to improve process results.
Statistical Prediction for analyzing Epidemiological Characteristics of COVID...Nuwan Sriyantha Bandara
In this presentation, we propose a novel integrated model for analyzing the characteristics of the epidemiological curve of COVID-19 by utilizing an enhanced compartmental statistical prediction model which is developed conferring susceptible-infectious susceptible (SIS) model, susceptible-infectious-removed (SIR) model, Dirichlet process model, and the interpretive structural model.
The oral presentation of the research has been presented at the International Research Conference 2020 of Sri Lanka Technological Campus on 17th June, 2020.
Abstract Link: http://repo.sltc.ac.lk/handle/1/82
Exploratory Data Analysis (EDA) was promoted by John Tukey in 1977 to encourage visually examining data without hypotheses. EDA uses graphical and non-graphical techniques like histograms, scatter plots, box plots to summarize variable characteristics. EDA allows understanding data distributions and relationships without models through inspection and information graphics. Common EDA goals are describing typical values, variability, distributions, and relationships between variables.
Boost model accuracy of imbalanced covid 19 mortality predictionBindhuBhargaviTalasi
The document proposes using a GAN-based oversampling technique to boost the accuracy of an imbalanced COVID-19 mortality prediction model. It analyzes patient data to find correlations between features like gender and age with mortality. A GAN is used to generate synthetic minority class data to address data imbalance. The GAN-enhanced random forest model is evaluated using metrics like recall, precision and F1 score, and shows improved performance over the base model at predicting COVID-19 mortality.
Classification of medical datasets using back propagation neural network powe...IJECEIAES
The classification is a one of the most indispensable domains in the data mining and machine learning. The classification process has a good reputation in the area of diseases diagnosis by computer systems where the progress in smart technologies of computer can be invested in diagnosing various diseases based on data of real patients documented in databases. The paper introduced a methodology for diagnosing a set of diseases including two types of cancer (breast cancer and lung), two datasets for diabetes and heart attack. Back Propagation Neural Network plays the role of classifier. The performance of neural net is enhanced by using the genetic algorithm which provides the classifier with the optimal features to raise the classification rate to the highest possible. The system showed high efficiency in dealing with databases differs from each other in size, number of features and nature of the data and this is what the results illustrated, where the ratio of the classification reached to 100% in most datasets).
COMPARISION OF PERCENTAGE ERROR BY USING IMPUTATION METHOD ON MID TERM EXAMIN...ijiert bestjournal
The issue of incomplete data exists across the enti re field of data mining. In this paper,Mean Imputation,Median Imputation and Standard Dev iation Imputation are used to deal with challenges of incomplete data on classifi cation problems. By using different imputation methods converts incomplete dataset in t o the complete dataset. On complete dataset by applying the suitable Imputatio n Method and comparing the percentage error of Imputation Method and comparing the result
This document outlines key concepts for analyzing qualitative and quantitative data. It discusses preparing data through editing, coding and inserting into a matrix. Graphical techniques like histograms, scatter plots and box plots are presented for depicting individual, comparative and relational data. Measures of central tendency, dispersion, relationships and models are explained including mean, median, standard deviation, correlation, and linear and non-linear models. The goal is for students to understand how to analyze data using appropriate statistical techniques and data visualization.
Novel Frequency Domain Classification Algorithm Based On Parameter Weight Fac...ahmedbohy
This work proposes two new classification techniques for predicting hepatitis mortality using a dataset from Ljubljana University. The first technique estimates missing values by finding the minimum difference between attribute values of the instance with missing values and other instances. The second technique computes a weight factor for each attribute by correlating the decision attribute with other attributes, and classifies new instances using correlation in the frequency domain on the top seven attributes. Experimental results on 155 instances show the frequency domain technique achieved a mean accuracy of 90.4%, higher than the first technique and previous methods.
PERFORMANCE ANALYSIS OF MULTICLASS SUPPORT VECTOR MACHINE CLASSIFICATION FOR ...ijcsa
Automatic diagnosis of coronary heart disease helps the doctor to support in decision making a diagnosis.
Coronary heart disease have some types or levels. Referring to the UCI Repository dataset, it divided into 4
types or levels that are labeled numbers 1-4 (low, medium, high and serious). The diagnosis models can be
analyzed with multi class classification approach. One of multi class classification approach used, one of
which is a support vector machine (SVM). The SVM use due to strong performance of SVM in binary classification. This research study multiclass performance classification support vector machine to diagnose the type or level of coronary heart disease. Coronary heart disease patient data taken from the
UCI Repository. Stages in this study is preprocessing, which consist of, to normalizing the data, divide the data into data training and testing. The next stage of multiclass classification and performance analysis.This study uses multiclass SVM algorithm, namely: Binary Tree Support Vector Machine (BTSVM), OneAgainst-One (OAO), One-Against-All (OAA), Decision Direct Acyclic Graph (DDAG) and Exhaustive
Output Error Correction Code ( ECOC). Performance parameter used is recall, precision, F-measure and
Overall accuracy. The experiment results showed that the multiclass SVM classification algorithm with the
algorithm BT-SVM, OAA-SVM and the ECOC-SVM,gave the highest Recall in the diagnosis of type or
healthy level with a value above 90%, precision 82.143% and 86.793% F-measure,. For all kinds of
algorithms, except binary OAA-SVM algorithm gave the highest recall 0.0% for the type or level of sickhigh
and sick-serious, and ECOC- SVM algorithm gave the highest recall 0.0% for sick-medium and sickserious.
While the type or level other, the performance of recall, precision and F-measure between 20% -
30%,. The conclusion that can be drawn is that the approach to the multiclass classification algorithm BTSVM,
OAO-SVM, DDAG-SVM to diagnosis the type or level of coronary heart disease provides better performance, than the binary classification approach.
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...Damian R. Mingle, MBA
Each year it has become more and more difficult for healthcare providers to determine if a patient has a pathology
related to the vertebral column. There is great potential to become more efficient and effective in terms of quality
of care provided to patients through the use of automated systems. However, in many cases automated systems
can allow for misclassification and force providers to have to review more causes than necessary. In this study, we
analyzed methods to increase the True Positives and lower the False Positives while comparing them against stateof-the-art
techniques in the biomedical community. We found that by applying the studied techniques of a data-driven
model, the benefits to healthcare providers are significant and align with the methodologies and techniques utilized
in the current research community.
Ensemble Classifier Approach in Breast Cancer Detection and Malignancy Gradin...ijmpict
The diagnosed cases of Breast cancer is increasing annually and unfortunately getting converted into a
high mortality rate. Cancer, at the early stages, is hard to detect because the malicious cells show similar
properties (density) as shown by the non-malicious cells. The mortality ratio could have been minimized if
the breast cancer could have been detected in its early stages. But the current systems have not been able
to achieve a fully automatic system which is not just capable of detecting the breast cancer but also can
detect the stage of it. Estimation of malignancy grading is important in diagnosing the degree of growth of
malicious cells as well as in selecting a proper therapy for the patient. Therefore, a complete and efficient
clinical decision support system is proposed which is capable of achieving breast cancer malignancy
grading scheme very efficiently. The system is based on Image processing and machine learning domains.
Classification Imbalance problem, a machine learning problem, occurs when instances of one class is
much higher than the instances of the other class resulting in an inefficient classification of samples and
hence a bad decision support system. Therefore EUSBoost, ensemble based classifier is proposed which is
efficient and is able to outperform other classifiers as it takes the benefits of both-boosting algorithm with
Random Undersampling techniques. Also comparison of EUSBoost with other techniques is shown in the
paper.
IRJET- Diabetes Prediction by Machine Learning over Big Data from Healthc...IRJET Journal
This document discusses using machine learning techniques to predict diabetes based on healthcare data. It proposes using preprocessing, K-means clustering, and support vector machine (SVM) classification. Preprocessing cleans and structures the data. K-means clusters the data into groups. SVM classification then predicts whether patients are diabetic or non-diabetic, aiming for a prediction accuracy of 94.9%. The techniques aim to allow for early diabetes prediction using a combination of machine learning methods on both structured and unstructured healthcare data.
ICU Patient Deterioration Prediction : A Data-Mining Approachcsandit
A huge amount of medical data is generated every da
y, which presents a challenge in analysing
these data. The obvious solution to this challenge
is to reduce the amount of data without
information loss. Dimension reduction is considered
the most popular approach for reducing
data size and also to reduce noise and redundancies
in data. In this paper, we investigate the
effect of feature selection in improving the predic
tion of patient deterioration in ICUs. We
consider lab tests as features. Thus, choosing a su
bset of features would mean choosing the
most important lab tests to perform. If the number
of tests can be reduced by identifying the
most important tests, then we could also identify t
he redundant tests. By omitting the redundant
tests, observation time could be reduced and early
treatment could be provided to avoid the risk.
Additionally, unnecessary monetary cost would be av
oided. Our approach uses state-of-the-art
feature selection for predicting ICU patient deteri
oration using the medical lab results. We
apply our technique on the publicly available MIMIC
-II database and show the effectiveness of
the feature selection. We also provide a detailed a
nalysis of the best features identified by our
approach.
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHcscpconf
A huge amount of medical data is generated every day, which presents a challenge in analysing
these data. The obvious solution to this challenge is to reduce the amount of data without
information loss. Dimension reduction is considered the most popular approach for reducing
data size and also to reduce noise and redundancies in data. In this paper, we investigate the
effect of feature selection in improving the prediction of patient deterioration in ICUs. We
consider lab tests as features. Thus, choosing a subset of features would mean choosing the
most important lab tests to perform. If the number of tests can be reduced by identifying the
most important tests, then we could also identify the redundant tests. By omitting the redundant
tests, observation time could be reduced and early treatment could be provided to avoid the risk.
Additionally, unnecessary monetary cost would be avoided. Our approach uses state-of-the-art
feature selection for predicting ICU patient deterioration using the medical lab results. We
apply our technique on the publicly available MIMIC-II database and show the effectiveness of
the feature selection. We also provide a detailed analysis of the best features identified by our
approach.
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...ahmad abdelhafeez
The goal of this paper is to compare between different classifiers or multi-classifiers fusion with respect to accuracy in discovering breast cancer for four different data sets. We present an implementation among various classification techniques which represent the most known algorithms in this field on four different datasets of breast cancer two for diagnosis and two for prognosis. We present a fusion between classifiers to get the best multi-classifier fusion approach to each data set individually. By using confusion matrix to get classification accuracy which built in 10-fold cross validation technique. Also, using fusion majority voting (the mode of the classifier output). The experimental results show that no classification technique is better than the other if used for all datasets, since the classification task is affected by the type of dataset. By using multi-classifiers fusion the results show that accuracy improved in three datasets out of four.
IRJET- Breast Cancer Prediction using Supervised Machine Learning AlgorithmsIRJET Journal
This document describes a study that used machine learning algorithms to predict breast cancer. The researchers trained decision tree, logistic regression, and random forest models on a breast cancer dataset. They preprocessed the data, which included converting categorical variables to numeric. The random forest model achieved the best prediction accuracy of 98.6%. The study aims to help detect breast cancer at earlier stages to improve treatment outcomes.
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...ahmad abdelhafeez
Abstract- The goal of this paper is to compare between different classifiers or multi-classifiers fusion with respect to accuracy in discovering breast cancer for four different data sets. We present an implementation among various classification techniques which represent the most known algorithms in this field on four different datasets of breast cancer two for diagnosis and two for prognosis. We present a fusion between classifiers to get the best multi-classifier fusion approach to each data set individually. By using confusion matrix to get classification accuracy which built in 10-fold cross validation technique. Also, using fusion majority voting (the mode of the classifier output). The experimental results show that no classification technique is better than the other if used for all datasets, since the classification task is affected by the type of dataset. By using multi-classifiers fusion the results show that accuracy improved in three datasets out of four.
Classification AlgorithmBased Analysis of Breast Cancer DataIIRindia
The classification algorithms are very frequently used algorithms for analyzing various kinds of data available in different repositories which have real world applications. The main objective of this research work is to find the performance of classification algorithms in analyzing Breast Cancer data via analyzing the mammogram images based its characteristics.Different attribute values of cancer affected mammogram images are considered for analysis in this work. The Patients food habits, age of the patients, their life styles, occupation, their problem about the diseases and other information are taken into account for classification. Finally, performance of classification algorithms J48, CART and ADTree are given with its accuracy. The accuracy of taken algorithms is measured by various measures like specificity, sensitivity and kappa statistics (Errors).
Theory and Practice of Integrating Machine Learning and Conventional Statisti...University of Malaya
The practice of medical decision making is changing rapidly with the development of innovative
computing technologies. The growing interest of data analysis in line with the advancement in data
science raises the question of whether machine learning can be integrated with conventional statistics
in health research. To help address this knowledge gap, this talk focuses on the conceptual
integration between conventional statistics and machine learning, with a direction towards health
research. The similarities and differences between the two are compared using mathematical
concepts and algorithms. The comparison between conventional statistics and machine learning
methods indicates that conventional statistics are the fundamental basis of machine learning, where
the black box algorithms are derived from basic mathematics, but are advanced in terms of
automated analysis, handling big data and providing interactive visualizations. While the nature of
both these methods are different, they are conceptually similar. The evidence produced here
concludes that conventional statistics and machine learning are best to be integrated to develop
automated data analysis tools. Health researchers may explore machine learning as a potential tool to
enhance conventional statistics in data analytics for added reliable validation measures.
This document provides an overview of statistics used in meta-analysis. It discusses key concepts like odds ratios, relative risk, confidence intervals, heterogeneity, and fixed and random effects models. It also summarizes different types of meta-analyses including realist reviews, meta-narrative reviews, and network meta-analyses. Software for performing meta-analyses and potential pitfalls in systematic reviews are also briefly covered.
An Empirical Study On Diabetes Mellitus Prediction For Typical And Non-Typica...Scott Faria
The document presents a study that uses machine learning approaches to predict diabetes for both typical and non-typical cases. Three machine learning algorithms (Bagging, Logistic Regression, Random Forest) were applied to a dataset of 340 patients with 26 features, and their accuracy was measured. Random Forest performed best with an accuracy of 90.29%, followed by Bagging at 89.12% and Logistic Regression at 83.24%.
Machine Learning Based Approaches for Cancer Classification Using Gene Expres...mlaij
The classification of different types of tumor is of great importance in cancer diagnosis and drug discovery.
Earlier studies on cancer classification have limited diagnostic ability. The recent development of DNA
microarray technology has made monitoring of thousands of gene expression simultaneously. By using this
abundance of gene expression data researchers are exploring the possibilities of cancer classification.
There are number of methods proposed with good results, but lot of issues still need to be addressed. This
paper present an overview of various cancer classification methods and evaluate these proposed methods
based on their classification accuracy, computational time and ability to reveal gene information. We have
also evaluated and introduced various proposed gene selection method. In this paper, several issues
related to cancer classification have also been discussed.
This document provides an overview of data mining techniques designed for imbalanced datasets. It discusses how imbalanced datasets, where one class is greatly underrepresented compared to others, pose challenges for machine learning algorithms. Several approaches have been proposed to address this issue, including sampling methods like oversampling the minority class and undersampling the majority class, as well as cost-sensitive methods that assign higher misclassification costs. The document reviews common sampling strategies used for imbalanced datasets, such as SMOTE, and cost-sensitive approaches involving cost matrices and cost curves. Overall, it examines how various sampling and cost-based methods can help improve classification of imbalanced datasets in fields such as medical diagnosis and text classification.
An overview on data mining designed for imbalanced datasetseSAT Journals
Abstract The imbalanced datasets with the classifying categories are not around equally characterized. A problem in imbalanced dataset occurs in categorization, where the amount of illustration of single class will be greatly lesser than the illustrations of the previous classes. Current existence brought improved awareness during implementation of machine learning methods to complex real world exertion, which is considered by several through imbalanced data. In machine learning the imbalanced datasets has become a critical problem and also usually found in many implementation such as detection of fraudulent calls, bio-medical, engineering, remote-sensing, computer society and manufacturing industries. In order to overcome the problems several approaches have been proposed. In this paper a study on Imbalanced dataset problem and examine various sampling method utilized in favour of evaluation of the datasets, moreover the interpretation methods are further suitable for imbalanced datasets mining. Keywords: Imbalance Problems, Imbalanced datasets, sampling strategies, Machine Learning.
Multi-Cluster Based Approach for skewed Data in Data MiningIOSR Journals
This document proposes a multi-cluster based majority under-sampling approach for handling skewed data in data mining. It begins with an introduction to the class imbalance problem, where one class has significantly more samples than the other class. Common techniques for addressing this include under-sampling the majority class and over-sampling the minority class. However, under-sampling can lose important majority class information. The proposed approach first clusters the majority class into multiple clusters. It then under-samples each cluster based on its size to select representative samples, avoiding information loss. It also oversamples the minority class. Experimental results on several datasets show the approach improves prediction of the minority class compared to standard under-sampling.
Diagnosis of Cancer using Fuzzy Rough Set TheoryIRJET Journal
This document presents a study that uses fuzzy rough set theory to diagnose cancer using medical data. It has four main modules: 1) feature selection to identify relevant features using fuzzy rough subset evaluation and particle swarm optimization, 2) instance selection to remove missing/noisy data, 3) classification of the data using fuzzy rough nearest neighbor algorithm, and 4) performance analysis using metrics like accuracy, sensitivity and AUC. The study aims to classify different cancer types by reducing noise and selecting optimal features/instances to improve classifier performance. It is found that fuzzy rough set approaches help preserve meaning during reduction and improve classification compared to other methods.
Similar to Hepatitis diagnosis based on Artificial Intelligence Using Single And multi classifiers (20)
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Hepatitis diagnosis based on Artificial Intelligence Using Single And multi classifiers
1. Hepatitis diagnosis based on Artificial
Intelligence Using Single And multi classifiers
Ahmed M. Salah EL-Bohy1
, Atallah I. Hashad2
, and Hussien Saad Taha3
Ahmed Azouz4
Arab Academy for Science, Technology & Maritime Transport, Cairo, Egypt
1
a.bohy4master@gmail.com, 2
hashad@aast.edu, and3
htaha10@yahoo.com. 4
azouzahmed79@gmail.com
Abstract: The goal of our paper is to obtain superior accuracy of different
classifiers or multi-classifiers fusion in diagnosing Hepatitis using world wide
data set from Ljubljana University. We present an implementation among some
of the classification methods, which are defined as the best algorithms in
medical field. Then we apply a fusion between classifiers to get the best multi-
classifier fusion approach. By using confusion matrix to get classification
accuracy which built in 10-fold cross validation technique. The experimental
results show that using multi-classifiers fusion achieved better accuracy than the
single one.
Keywords: Classification techniques; Fusion; WEKA.
I. INTRODUCTION
The diagnosis of some diseases like hepatitis is very difficult task for a doctor, where doctors usually
determine decision by comparing the current test results of patients with another one who has the same
condition. Hepatitis is one of the most common diseases among Egyptians; as it represents 22% of hepatitis
cases around the world. This motivates us for suggesting new methods to improve the outcomes of existing
approaches, as well as to help doctors and specialists to diagnose hepatitis disease survival [1]. Hepatitis is
an inflammation of the liver. The condition can be self-limiting or can progress to fibrosis (scarring),
cirrhosis or liver cancer. Hepatitis viruses are the most common cause of hepatitis in the world but other
infections, toxic substances (e.g. alcohol, certain drugs), and autoimmune diseases can cause hepatitis.
There are 5 main hepatitis viruses, referred to as types A, B, C, D and E. These 5 types are of greatest
concern because of the burden of illness and death they cause and the potential for outbreaks and epidemic
spread. In particular, types B and C lead to chronic disease in hundreds of millions of people and, together,
are the most common cause of liver cirrhosis and cancer. Hepatitis B virus (HBV) is transmitted through
exposure to infective blood, semen, and other body fluids. HBV can be transmitted from infected mothers to
infants at the time of birth or from family member to infant in early childhood. Transmission may also occur
through transfusions of HBV-contaminated blood and blood products, contaminated injections during
medical procedures, and through injection drug use. HBV also poses a risk to healthcare workers who
sustain accidental needle stick injuries while caring for infected-HBV patients. Safe and effective vaccines
are available to prevent HBV. Hepatitis C virus (HCV) is mostly transmitted through exposure to infective
blood. This may happen through transfusions of HCV-contaminated blood and blood products,
contaminated injections during medical procedures, and through injection drug use. Sexual transmission is
also possible, but is much less common. There is no vaccine for HCV. Inflammation may lead to death of
the liver cells (hepatocytes) which severely compromises normal liver function. Acute HBV Infection (less
than 6 months) may resemble the fever, flu, muscle aches, joint pains and generally being unwell.
2. Symptoms specifying that states are dark urine, loss of appetite, nausea, vomiting, jaundice, pain up the
liver. [2]. the success of treatment depends on an early recognition of the virus, which achieves more exact
and less violent treatment options and mortality from Hepatitis falls. Recently, data-mining has become one
of the most treasured tools for operating data in order to produce valuable information for decision-making
[3]. Supervised learning, including classification is one of the most significant brands in data mining, with a
recognized output variable in the dataset. Classification methods can achieve high accuracy in classifying
mortality cases. The implementations had been applied with a “WEKA” tool, which stands for the Waikato
Environment for Knowledge Analysis. A lot of papers about applying machine-learning procedures for
survivability analysis in the field of Hepatitis diagnostic. Here are some examples: Using Support Vector
Machines and Wrapper Method for predicting Hepatitis was introduced achieving maximum accuracy of
(74.55%). [4]. but we note that applying SVM classifier only get higher accuracy than the mentioned
accuracies even with feature selection. The accuracy of SVM improved algorithm using feature selection
[5]. Using SVM with Chi-Square achieved accuracy of (83.12 %), but we note that applying another
classifier (Logistic, Simple Logistic, SMO, RF, J48) gets higher accuracy than the mentioned one. Detection
of Hepatitis based on SVM and Data Analysis [6] using SVM & wrapper the achieved accuracy is (85%)
but we found the results are calculated with 7 attributes out of 25 ones as stated whenever the original
number of attributes is 20 only The rest of this paper is prearranged like this: In sector II, Classification
algorithms are discussed. In sector III Evaluation, principles are discussed. In sector, IV a proposed model
is shown. In sector, V reports the experimental results. Finally, Sector VI introduces the conclusion.
II. Classification Algorithms
Support Vector Machine (SVM) SVMs are among the best (and many believe are indeed the best)“on-
the-shelf ” supervised learning algorithm. It is derived from statistics in 1992.SVM is widely used in
multiple applications pattern recognition, classification and regression. The SVMs work on an underlying
principle, which is to insert a hyper-plane between the classes and orient it in such a way to keep it at the
maximum distance from the nearest data points these data points, which appear closest to the hyper-plane,
are known as Support Vectors [7-9]. Sequential Minimal Optimization (SMO) simple, easy to implement, is
generally faster, and has better scaling properties for difficult SVM problems than the standard SVM
training algorithm SMO quickly solve the SVM QP problem without extra matrix storage (the memory used
is linear with the training data set size) and without using numerical QP optimization steps at all. SMO can
be used for online learning. While SMO has been shown to be effective on sparse data sets and especially
fast for linear SVMs, the algorithm can be extremely slow on non-sparse data sets and on problems that
have many support vectors. Regression problems are especially prone to these issues because the inputs are
usually non-sparse real numbers (as opposed to binary inputs) with solutions that have many support
vectors. Because of these constraints, there have been few reports of SMO being successfully used on
regression problems [8-10]. Logistic Regression (LR) is a famous well-known classifier; it could be used to
extend classification results into a deeper analysis. It is not widely used due to its slow response in data
mining especially when compared with SVM in large data sets (not our case) it models the relationship
between a dependent and one or more independent variables, and consents us to look at the fit of the model
as well as at the significance of the relationships [11]. Simple Logistic: We use simple logistic regression
when we have one nominal variable with two values (male/female, dead/alive) and one measurement
variable. The nominal variable is the dependent variable, and the measurement variable is the independent
variable. Simple logistic regression is analogous to linear regression, except the dependent variable is
nominal, not a measurement [12]. Random Forest: Random forests change how the classification or
regression trees are constructed. In standard trees, each node is split using the best split among all variables.
In a random forest, each node is split using the best among a subset of predictors randomly chosen at that
node. It works by one of two methods, boosting and bagging.it has the advantages: handling thousands of
input variables without variable deletion, giving estimates of what variables are important in the
3. classification, and it has an effective method for estimating missing data and maintains accuracy when a
large proportion of the data are missing [13].
III. Evaluation principles
Evaluation method is based on the confusion matrix. The confusion matrix is an imagining implement
usually used to show presentations of classifiers. It is used to display the relationships between real class
attributes and predicted classes. The grade of efficiency of the classification task is calculated with the
number of exact and unseemly classifications in each conceivable value of the variables being classified in
the confusion matrix [14]
Table 1. Confusion matrix
Predicted Class
Negative Positive
Outcome
s
Negative TN FN
Positive FP TP
For instance, in a 2-class classification problem with two predefined classes (e.g., Positive, negative) the
classified test cases are divided into four categories:
• True positives (TP) correctly classified as positive instances.
• True negatives (TN) correctly classified negative instances.
• False positives (FP) incorrectly classified negative instances
• False negatives (FN) incorrectly classified positive instances.
The evaluate classifier performance. We use accuracy term, which is defined as the entire number of
truly classified instances divided by the entire number of available instances for an assumed operational
point of a classifier.
Accuracy
TP TN
TP TN FP FN
(1)
IV. Proposed methodology
We propose a reliable method for diagnosing Hepatitis and better classifying mortality cases that may be
caused by Hepatitis infection based on data mining using WEKA as follows:
A. Data preprocessing
Pre-processing steps are applied to the data before classification: First step is Data Cleansing: eliminating
or decreasing noise and the treatment of missing values. We will not apply this step to Hepatitis data set.
Second step is Relevance Analysis: Statistical correlation analysis is used to discard the redundant features
from further analysis. The data set has one irrelevant attribute named ‘sex ’ which has no effect in the
classification process as we perform same classification tasks with the attribute “sex” altered male to female
and vice versa with typically same results. Final, step is Data Transformation: The dataset is transformed by
normalization, which is one of the greatest public tools used by inventors of automatic recognition
classifications to get superior results. Data normalization hurries up training time by initialing the training
procedure to reach feature within the same scale. The aim of normalization is to transform the attribute
values to a small-scale range.
B. Single classification task
Data classification is the process of organizing data into categories for its most effective and efficient
use. A well-planned data classification system makes essential data easy to find and retrieve. Classification
consists of assigning a class label to a set of unclassified cases. Supervised classification: The set of
possible classes is known in advance. Unsupervised classification: Set of possible classes is not known.
After classification, we can try to assign a name to that class. Unsupervised classification is called
4. clustering. The presumed model depends on the training dataset analysis. The derivative model
characterized in several procedures, such as simple classification rules, decision trees and another. Basically
data classification is a two-stage process, in the initial stage; a classifier is built signifying a predefined set
of notions or data classes. This is the training stage, where a classification technique builds the classifier by
learning from a training dataset and their related class label columns or attributes. In next stage, this model
is used for measurement. In order to deduce the predictive accuracy of the classifier an independent set of
the training instances is used. We evaluate the most frequently classification techniques that mentioned in
recently published researches in the medical field to accomplish the highest accuracy classifier’s result with
each data set.
C. Multi-classifiers fusion classification task
Fusion is a combination between 2 or more classifiers to enhance classifier performance (Accuracy). Our
procedure will be electing the highest 2 classifiers in accuracy calculating accuracy, then the 3rd, then 4th
and so on until accuracy decreases then stop & take the maximum accuracy obtained as shown in Figure 1
We name the number of classifiers used in the fusion process “fusion level” best accuracy with other single
classifiers predicting to improve accuracy. Repeating the same process until the latest level of fusion,
according to the number of single classifiers to pick the highest accuracy through all processes.
We propose 2 methods (1st Directly applying a classifier to the raw data & 2nd preprocess the data
before applying classifier).
Import the Dataset.
Convert nominal values to binary ones (in case of preprocessing).
Normalize each variable of the data set, so that the values range from 0 to 1 (in case of
preprocessing).
Create a separate training (learn) and testing sets by indiscriminately drawing out the data for
training and for testing.
Picking and adapt the learning procedure.
Perform the learning procedure.
Calculate the performance of the model on the test set.
Figure 1 .Proposed Hepatitis diagnosis model.
5. According to the mentioned work in sector IV proposed methodology-A- data preprocessing we now have 2
data sets with 155instances each first is the data set in the root form (raw data) and the second data set has
been preprocessed as mentioned in sector IV proposed methodology-A- data transformation To calculate the
proposed model, four experiments were implemented. Two experiments for each data set the first one using
single classifier and the second using fusion
V. EXPERIMENTAL RESULTS
A. Single classification task
Table (2) shows the comparison between accuracies over single classifiers (SMO, RF, Simple logistic,
Logistic, and SVM).
Table 2 Single classifiers accuracy results
Classifier SMO RF Simple Logistic Logistic SVM
Raw data 85.16 84.52 83.88 82.58 79.35
Preprocessed 85.83 84.5 83.13 83.88 81.25
Raw data: The SMO has the highest accuracy (85.16%). Beating the Related work proposed in the
introduction of this paper without any preprocessing or feature selection [4-6]. Shown in figure 2.
Figure 2 .Single classifiers accuracy (Raw data)
Data Preprocessing (Convert normal to binary, & Normalize data): SMO still the dominant classifier
among the five selected classifiers achieving again the highest accuracy with (85.83%), RF classifier comes
in the second rank among single classifiers for both data sets and it is highlighted in cyan color in Table 2.
Shown in figure 3
.
Figure 3 .Single classifiers accuracy (preprocessed data)
6. B. Multi-classifiers fusion task
Table (3) shows the comparison between accuracies over multi classifier fusion
TABLE 3. Multiple classifiers (Fusion) highest accuracies
#of classifiers 2 3 4 5
Raw data 85.17 85.81 86.45 87.01
Preprocessed 85.83 86.42 87.01 86.67
Raw data: We got the best accuracy at the fifth fusion level (87.1%).combining the completely five
classifiers. Data Preprocessing (Convert normal to binary, & Normalize data): We obtain the best accuracy
(87.1%) same value as raw data but we gain it at the fourth fusion level after that accuracy started to
decrease.
Figure 4 .Single and multi-classifiers accuracy (Raw/ preprocessed data)
VI. Conclusion and future work
The experimental results clear up that multi-classifiers fusion achieved better accuracy than single
classifier. for both data sets So we conclude the experiments as follows: Ascertaining SMO classifier as a
superior classifier to apply to UCI Hepatitis data set as it gives better accuracy than the commonly used
technique SVM even when applying SVM with feature selection or CHI square method. Handling the Raw
data with the same classifiers (Fusion) after Converting normal to binary, and Normalize data reduce one
fusion level as we got same accuracy using only four classifier fusion instead of five. We suggest future
work to be in the area of eliminating noise, reduce the unwanted values and having a data set with no
missing values to study the effect of these missing values in both single and multi-classifiers cases.
REFERENCES
[1] World Health Organization at http://www.who.int
[2] Tomasz KANIK. “Hepatitis disease diagnosis using Rough Set modification of the pre-processing
algorithm.” Information and communication technologies international conference (2012).
[3] Fayyad, Usama, Gregory Piatetsky-Shapiro, and Padhraic Smyth."From data mining to knowledge
discovery in databases." AI magazine 17.3 (1996).
7. [4] A.H.Roslina and A.Noraziah. "Prediction Of Hepatitis Prognosis Using Support Vector Machines And
Wrapper Method." Seventh International Conference on Fuzzy Systems and Knowledge Discovery
(2010).
[5] Varun Kumar.M , Vijay Sharathi.V and Gayathri Devi.B.R. " Hepatitis Prediction Model based on Data
Mining Algorithm and Optimal Feature Selection to Improve Predictive Accuracy." International
Journal of Computer Applications (0975 – 8887) Volume 51– No.19, August 2012.
[6] C. Barath Kumar , M. Varun Kumar , T. Gayathri, and S. Rajesh Kumar " Analysis and Prediction of
Hepatitis Using Support Vector Machine." International Journal of Computer Science and Information
Technologies, Vol. 5 (2) , 2014, 2235-2237.
[7] Jyotirmay Gadewadikar, Ognjen Kuljaca1, Kwabena Agyepong, Erol Sarigul3, Yufeng Zheng and Ping
Zhang, Exploring Bayesian networks for medical decision support in breast cancer detection, African
Journal of Mathematics and Computer Science Research Vol. 3(10), pp. 225-231, October 2010.
[8] Ross Quinlan, (1993) C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San
Mateo, CA.
[9] Chen Y., Wang G., and Dong S. “Learning with progressive transductive support vector machine”,
Pattern Recognition Letters, Vol. 24, Pages: 1845-1855
[10]John C. Platt. “Sequential Minimal Optimization:A Fast Algorithm for Training Support Vector
Machines.” Technical Report MSR-TR-98-14April.
[11]Mitchell T. “Machine Learning, McGraw Hill.” (1997).
[12]John H. McDonald. “Handbook of Biological Statistics.” at http://www.strath.ac.uk
[13]Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.
[14]P. Cabena, P. Hadjinian, R. Stadler, J. Verhees and A.Zanasi, Discovering data mining from concept to
implementation. UpperSaddle River, N.J.: Prentice Hall, 1998.