Machine learning algorithms play a vital role in prediction of many diseases such as heart disease, diabetes, cancer, lung disease etc. The applicability of machine learning algorithms to healthcare domain relieves the burden of physicians as it is impractical to scan manually all the data collected over a period of time in order to arrive at some valuable information. Machine learning algorithms learn from the training dataset and they become capable of thinking like a human. Once the algorithm completes it learning with training dataset, it can automatically predict the target output label of any unseen data. In this work, predicting diabetes using machine learning algorithms has been taken up. A conceptual architecture has been proposed based on big data architecture.
Diabetes Prediction by Supervised and Unsupervised Approaches with Feature Se...IJARIIT
Two approaches to building models for prediction of the onset of Type diabetes mellitus in juvenile subjects were examined. A set of tests performed immediately before diagnosis was used to build classifiers to predict whether the subject would be diagnosed with juvenile diabetes. A modified training set consisting of differences between test results taken at different times was also used to build classifiers to predict whether a subject would be diagnosed with juvenile diabetes. Supervised were compared with decision trees and unsupervised of both types of classifiers. In this study, the system and the test most likely to confirm a diagnosis based on the pre-test probability computed from the patient's information including symptoms and the results of previous tests. If the patient's disease post-test probability is higher than the treatment threshold, a diagnostic decision will be made, and vice versa. Otherwise, the patient needs more tests to help make a decision. The system will then recommend the next optimal test and repeat the same process. In this thesis find out which approach is better on diabetes dataset in weka framework. Also use feature selection techniques which reduce the features and complexities of process
Disease prediction in big data healthcare using extended convolutional neural...IJAAS Team
Diabetes Mellitus is one of the growing fatal diseases all over the world. It leads to complications that include heart disease, stroke, and nerve disease, kidney damage. So, Medical Professionals want a reliable prediction system to diagnose Diabetes. To predict the diabetes at earlier stage, different machine learning techniques are useful for examining the data from different sources and valuable knowledge is synopsized. So, mining the diabetes data in an efficient way is a crucial concern. In this project, a medical dataset has been accomplished to predict the diabetes. The R-Studio and Pypark software was employed as a statistical computing tool for diagnosing diabetes. The PIMA Indian database was acquired from UCI repository will be used for analysis. The dataset was studied and analyzed to build an effective model that predicts and diagnoses the diabetes disease earlier.
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUESIAEME Publication
Diabetes mellitus is a common disease caused by a set of metabolic ailments
where the sugar stages over drawn-out period is very high. It touches diverse organs
of the human body which therefore harm a huge number of the body's system, in
precise the blood strains and nerves. Early prediction in such disease can be exact
and save human life. To achieve the goal, this research work mainly discovers
numerous factors associated to this disease using machine learning techniques.
Machine learning methods provide effectual outcome to extract knowledge by building
predicting models from diagnostic medical datasets together from the diabetic
patients. Quarrying knowledge from such data can be valuable to predict diabetic
patients. In this research, six popular used machine learning techniques, namely
Random Forest (RF), Logistic Regression (LR), Naive Bayes (NB), C4.5 Decision
Tree (DT), K-Nearest Neighbor (KNN), and Support Vector Machine (SVM) are
compared in order to get outstanding machine learning techniques to forecast diabetic
mellitus. Our new outcome shows that Support Vector Machine (SVM) achieved
higher accuracy compared to other machine learning techniques.
Machine learning approach for predicting heart and diabetes diseases using da...IAESIJAI
Environmental changes and food habits affect people's health with numerous diseases in today's life. Machine learning is a technique that plays a vital role in predicting diseases from collected data. The health sector has plenty of electronic medical data, which helps this technique to diagnose various diseases quickly and accurately. There has been an improvement in accuracy in medical data analysis as data continues to grow in the medical field. Doctors may have a hard time predicting symptoms accurately. This proposed work utilized Kaggle data to predict and diagnose heart and diabetic diseases. The diseases heart and diabetes are the foremost cause of higher death rates for people. The dataset contains target features for the diagnosis of heart disease. This work finds the target variable for diabetic disease by comparing the patient's blood sugars to normal levels. Blood pressure, body mass index (BMI), and other factors diagnose these diseases and disorders. This work justifies the filter method and principal component analysis for selecting and extracting the feature. The main aim of this work is to highlight the implementation of three ensemble techniques-Adaptive boost, Extreme Gradient boosting, and Gradient boosting-as well as the emphasis placed on the accuracy of the results.
Diabetes Prediction by Supervised and Unsupervised Approaches with Feature Se...IJARIIT
Two approaches to building models for prediction of the onset of Type diabetes mellitus in juvenile subjects were examined. A set of tests performed immediately before diagnosis was used to build classifiers to predict whether the subject would be diagnosed with juvenile diabetes. A modified training set consisting of differences between test results taken at different times was also used to build classifiers to predict whether a subject would be diagnosed with juvenile diabetes. Supervised were compared with decision trees and unsupervised of both types of classifiers. In this study, the system and the test most likely to confirm a diagnosis based on the pre-test probability computed from the patient's information including symptoms and the results of previous tests. If the patient's disease post-test probability is higher than the treatment threshold, a diagnostic decision will be made, and vice versa. Otherwise, the patient needs more tests to help make a decision. The system will then recommend the next optimal test and repeat the same process. In this thesis find out which approach is better on diabetes dataset in weka framework. Also use feature selection techniques which reduce the features and complexities of process
Disease prediction in big data healthcare using extended convolutional neural...IJAAS Team
Diabetes Mellitus is one of the growing fatal diseases all over the world. It leads to complications that include heart disease, stroke, and nerve disease, kidney damage. So, Medical Professionals want a reliable prediction system to diagnose Diabetes. To predict the diabetes at earlier stage, different machine learning techniques are useful for examining the data from different sources and valuable knowledge is synopsized. So, mining the diabetes data in an efficient way is a crucial concern. In this project, a medical dataset has been accomplished to predict the diabetes. The R-Studio and Pypark software was employed as a statistical computing tool for diagnosing diabetes. The PIMA Indian database was acquired from UCI repository will be used for analysis. The dataset was studied and analyzed to build an effective model that predicts and diagnoses the diabetes disease earlier.
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUESIAEME Publication
Diabetes mellitus is a common disease caused by a set of metabolic ailments
where the sugar stages over drawn-out period is very high. It touches diverse organs
of the human body which therefore harm a huge number of the body's system, in
precise the blood strains and nerves. Early prediction in such disease can be exact
and save human life. To achieve the goal, this research work mainly discovers
numerous factors associated to this disease using machine learning techniques.
Machine learning methods provide effectual outcome to extract knowledge by building
predicting models from diagnostic medical datasets together from the diabetic
patients. Quarrying knowledge from such data can be valuable to predict diabetic
patients. In this research, six popular used machine learning techniques, namely
Random Forest (RF), Logistic Regression (LR), Naive Bayes (NB), C4.5 Decision
Tree (DT), K-Nearest Neighbor (KNN), and Support Vector Machine (SVM) are
compared in order to get outstanding machine learning techniques to forecast diabetic
mellitus. Our new outcome shows that Support Vector Machine (SVM) achieved
higher accuracy compared to other machine learning techniques.
Machine learning approach for predicting heart and diabetes diseases using da...IAESIJAI
Environmental changes and food habits affect people's health with numerous diseases in today's life. Machine learning is a technique that plays a vital role in predicting diseases from collected data. The health sector has plenty of electronic medical data, which helps this technique to diagnose various diseases quickly and accurately. There has been an improvement in accuracy in medical data analysis as data continues to grow in the medical field. Doctors may have a hard time predicting symptoms accurately. This proposed work utilized Kaggle data to predict and diagnose heart and diabetic diseases. The diseases heart and diabetes are the foremost cause of higher death rates for people. The dataset contains target features for the diagnosis of heart disease. This work finds the target variable for diabetic disease by comparing the patient's blood sugars to normal levels. Blood pressure, body mass index (BMI), and other factors diagnose these diseases and disorders. This work justifies the filter method and principal component analysis for selecting and extracting the feature. The main aim of this work is to highlight the implementation of three ensemble techniques-Adaptive boost, Extreme Gradient boosting, and Gradient boosting-as well as the emphasis placed on the accuracy of the results.
An efficient stacking based NSGA-II approach for predicting type 2 diabetesIJECEIAES
Diabetes has been acknowledged as a well-known risk factor for renal and cardiovascular disorders, cardiac stroke and leads to a lot of morbidity in the society. Reducing the disease prevalence in the community will provide substantial benefits to the community and lessen the burden on the public health care system. So far, to detect the disease innumerable data mining approaches have been used. These days, incorporation of machine learning is conducive for the construction of a faster, accurate and reliable model. Several methods based on ensemble classifiers are being used by researchers for the prediction of diabetes. The proposed framework of prediction of diabetes mellitus employs an approach called stacking based ensemble using non-dominated sorting genetic algorithm (NSGA-II) scheme. The primary objective of the work is to develop a more accurate prediction model that reduces the lead time i.e., the time between the onset of diabetes and clinical diagnosis. Proposed NSGA-II stacking approach has been compared with Boosting, Bagging, Random Forest and Random Subspace method. The performance of Stacking approach has eclipsed the other conventional ensemble methods. It has been noted that k-nearest neighbors (KNN) gives a better performance over decision tree as a stacking combiner.
An efficient feature selection algorithm for health care data analysisjournalBEEI
Diabete is a silent killer, which will slowly kill the person if it goes undetected. The existing system which uses F-score method and K-means clustering of checking whether a person has diabetes or not are 100% accurate, and anything which isn't a 100% is not acceptable in the medical field, as it could cost the lives of many people. Our proposed system aims at using some of the best features of the existing algorithms to predict diabetes, and combine these and based on these features; This research work turns them into a novel algorithm, which will be 100% accurate in its prediction. With the surge in technological advancements, we can use data mining to predict when a person would be diagnosed with diabetes. Specifically, we analyze the best features of chi-square algorithm and advanced clustering algorithm (ACA). This research work is done using the Pima Indian Diabetes dataset provided by National Institutes of Diabetes and Digestive and Kidney Diseases. Using classification theorems and methods we can consider different factors like age, BMI, blood pressure and the importance given to these attributes overall, and singles these attributes out, and use them for the prediction of diabetes.
AN IMPROVED MODEL FOR CLINICAL DECISION SUPPORT SYSTEMijaia
Misguided information in health care has caused much havoc that have led to the death of millions of people as a result of misclassification, and inconsistent health care records; hence the objective of this paper is to develop an improved clinical decision support system. This system incorporated hybrid system
of non-knowledge based and knowledge based decision support system for the diagnosis of diseases and proper health care delivery records using prostate cancer and diabetes datasets to train and validate the model. The min-max method was adopted in normalizing the datasets, while genetic algorithm was
deployed in initiating the training weights of the MLP. The result obtained in this paper yielded a classification accuracy of 98%, sensitivity of 0.98 and specificity of 100 for prostate cancer and accuracy of 94%, sensitivity of 0.94 and specificity of 0.67 for diabetes.
Standardization and wider use of Electronic Health records (EHR) creates opportunities for
better understanding patterns of illness and care within and across medical systems. In the healthcare
systems, hidden event signatures allow taking decision for patient’s diagnosis, prognosis, and
management. Temporal history of event codes embedded in patients' records, investigates frequently
occurring sequences of event codes across patients. There is a framework that enables the
representation, retrieval, and mining of high order latent event structure and relationships within
single and multiple event sequences. There is a wealth of hidden information present in the large
databases. Different data mining techniques can be used for retrieving data. A classifier approach for
detection of diabetes is presented in this paper and shows how Naive Bayes can be used for
classification purpose. In this system, medical data is categories into five categories namely low,
average, high and very high and critical, treatment is given as per the predicted category. The system
will predict the class label of unknown sample. Hence two basic functions namely classification
(training) and prediction (testing) will be performed. An algorithm and database used affects the
accuracy of the system. It can answer complex queries for diagnosing diabetes disease and thus assist
healthcare practitioners to make intelligent clinical decisions which traditional decision support
systems cannot.Over the last decade, so many information visualization techniques have been
developed to support the exploration of large data sets. There are various interactive visual data
mining tools available for visual data analysis. It is possible to perform clinical assessment for visual
interactive knowledge discovery in large electronic health record databases. In this paper, we
proposed that it is possible to develop a tool for data visualization for interactive knowledge
discovery.
Hybrid filtering methods for feature selection in high-dimensional cancer dataIJECEIAES
Statisticians in both academia and industry have encountered problems with high-dimensional data. The rapid feature increase has caused the feature count to outstrip the instance count. There are several established methods when selecting features from massive amounts of breast cancer data. Even so, overfitting continues to be a problem. The challenge of choosing important features with minimum loss in a different sample size is another area with room for development. As a result, the feature selection technique is crucial for dealing with high-dimensional data classification issues. This paper proposed a new architecture for high-dimensional breast cancer data using filtering techniques and a logistic regression model. Essential features are filtered out using a combination of hybrid chi–square and hybrid information gain (hybrid IG) with logistic regression as classifier. The results showed that hybrid IG performed the best for high-dimensional breast and prostate cancer data. The top 50 and 22 features outperformed the other configurations, with the highest classification accuracies of 86.96% and 82.61%, respectively, after integrating the hybrid information gain and logistic function (hybrid IG+LR) with a sample size of 75. In the future, multiclass classification of multidimensional medical data to be evaluated using data from a different domain.
The perfect Sundabet Slot mudah menang Promo new member Animated PDF for your conversation. Discover and Share the best GIFs on Tenor
Admin Ramah Cantik Aktif 24 Jam Nonstop siap melayani pemain member Sundabet login via apk sundabet rtp daftar slot gacor daftar
An efficient stacking based NSGA-II approach for predicting type 2 diabetesIJECEIAES
Diabetes has been acknowledged as a well-known risk factor for renal and cardiovascular disorders, cardiac stroke and leads to a lot of morbidity in the society. Reducing the disease prevalence in the community will provide substantial benefits to the community and lessen the burden on the public health care system. So far, to detect the disease innumerable data mining approaches have been used. These days, incorporation of machine learning is conducive for the construction of a faster, accurate and reliable model. Several methods based on ensemble classifiers are being used by researchers for the prediction of diabetes. The proposed framework of prediction of diabetes mellitus employs an approach called stacking based ensemble using non-dominated sorting genetic algorithm (NSGA-II) scheme. The primary objective of the work is to develop a more accurate prediction model that reduces the lead time i.e., the time between the onset of diabetes and clinical diagnosis. Proposed NSGA-II stacking approach has been compared with Boosting, Bagging, Random Forest and Random Subspace method. The performance of Stacking approach has eclipsed the other conventional ensemble methods. It has been noted that k-nearest neighbors (KNN) gives a better performance over decision tree as a stacking combiner.
An efficient feature selection algorithm for health care data analysisjournalBEEI
Diabete is a silent killer, which will slowly kill the person if it goes undetected. The existing system which uses F-score method and K-means clustering of checking whether a person has diabetes or not are 100% accurate, and anything which isn't a 100% is not acceptable in the medical field, as it could cost the lives of many people. Our proposed system aims at using some of the best features of the existing algorithms to predict diabetes, and combine these and based on these features; This research work turns them into a novel algorithm, which will be 100% accurate in its prediction. With the surge in technological advancements, we can use data mining to predict when a person would be diagnosed with diabetes. Specifically, we analyze the best features of chi-square algorithm and advanced clustering algorithm (ACA). This research work is done using the Pima Indian Diabetes dataset provided by National Institutes of Diabetes and Digestive and Kidney Diseases. Using classification theorems and methods we can consider different factors like age, BMI, blood pressure and the importance given to these attributes overall, and singles these attributes out, and use them for the prediction of diabetes.
AN IMPROVED MODEL FOR CLINICAL DECISION SUPPORT SYSTEMijaia
Misguided information in health care has caused much havoc that have led to the death of millions of people as a result of misclassification, and inconsistent health care records; hence the objective of this paper is to develop an improved clinical decision support system. This system incorporated hybrid system
of non-knowledge based and knowledge based decision support system for the diagnosis of diseases and proper health care delivery records using prostate cancer and diabetes datasets to train and validate the model. The min-max method was adopted in normalizing the datasets, while genetic algorithm was
deployed in initiating the training weights of the MLP. The result obtained in this paper yielded a classification accuracy of 98%, sensitivity of 0.98 and specificity of 100 for prostate cancer and accuracy of 94%, sensitivity of 0.94 and specificity of 0.67 for diabetes.
Standardization and wider use of Electronic Health records (EHR) creates opportunities for
better understanding patterns of illness and care within and across medical systems. In the healthcare
systems, hidden event signatures allow taking decision for patient’s diagnosis, prognosis, and
management. Temporal history of event codes embedded in patients' records, investigates frequently
occurring sequences of event codes across patients. There is a framework that enables the
representation, retrieval, and mining of high order latent event structure and relationships within
single and multiple event sequences. There is a wealth of hidden information present in the large
databases. Different data mining techniques can be used for retrieving data. A classifier approach for
detection of diabetes is presented in this paper and shows how Naive Bayes can be used for
classification purpose. In this system, medical data is categories into five categories namely low,
average, high and very high and critical, treatment is given as per the predicted category. The system
will predict the class label of unknown sample. Hence two basic functions namely classification
(training) and prediction (testing) will be performed. An algorithm and database used affects the
accuracy of the system. It can answer complex queries for diagnosing diabetes disease and thus assist
healthcare practitioners to make intelligent clinical decisions which traditional decision support
systems cannot.Over the last decade, so many information visualization techniques have been
developed to support the exploration of large data sets. There are various interactive visual data
mining tools available for visual data analysis. It is possible to perform clinical assessment for visual
interactive knowledge discovery in large electronic health record databases. In this paper, we
proposed that it is possible to develop a tool for data visualization for interactive knowledge
discovery.
Hybrid filtering methods for feature selection in high-dimensional cancer dataIJECEIAES
Statisticians in both academia and industry have encountered problems with high-dimensional data. The rapid feature increase has caused the feature count to outstrip the instance count. There are several established methods when selecting features from massive amounts of breast cancer data. Even so, overfitting continues to be a problem. The challenge of choosing important features with minimum loss in a different sample size is another area with room for development. As a result, the feature selection technique is crucial for dealing with high-dimensional data classification issues. This paper proposed a new architecture for high-dimensional breast cancer data using filtering techniques and a logistic regression model. Essential features are filtered out using a combination of hybrid chi–square and hybrid information gain (hybrid IG) with logistic regression as classifier. The results showed that hybrid IG performed the best for high-dimensional breast and prostate cancer data. The top 50 and 22 features outperformed the other configurations, with the highest classification accuracies of 86.96% and 82.61%, respectively, after integrating the hybrid information gain and logistic function (hybrid IG+LR) with a sample size of 75. In the future, multiclass classification of multidimensional medical data to be evaluated using data from a different domain.
The perfect Sundabet Slot mudah menang Promo new member Animated PDF for your conversation. Discover and Share the best GIFs on Tenor
Admin Ramah Cantik Aktif 24 Jam Nonstop siap melayani pemain member Sundabet login via apk sundabet rtp daftar slot gacor daftar
2137ad - Characters that live in Merindol and are at the center of main storiesluforfor
Kurgan is a russian expatriate that is secretly in love with Sonia Contado. Henry is a british soldier that took refuge in Merindol Colony in 2137ad. He is the lover of Sonia Contado.
2137ad Merindol Colony Interiors where refugee try to build a seemengly norm...luforfor
This are the interiors of the Merindol Colony in 2137ad after the Climate Change Collapse and the Apocalipse Wars. Merindol is a small Colony in the Italian Alps where there are around 4000 humans. The Colony values mainly around meritocracy and selection by effort.
Explore the multifaceted world of Muntadher Saleh, an Iraqi polymath renowned for his expertise in visual art, writing, design, and pharmacy. This SlideShare delves into his innovative contributions across various disciplines, showcasing his unique ability to blend traditional themes with modern aesthetics. Learn about his impactful artworks, thought-provoking literary pieces, and his vision as a Neo-Pop artist dedicated to raising awareness about Iraq's cultural heritage. Discover why Muntadher Saleh is celebrated as "The Last Polymath" and how his multidisciplinary talents continue to inspire and influence.
thGAP - BAbyss in Moderno!! Transgenic Human Germline Alternatives ProjectMarc Dusseiller Dusjagr
thGAP - Transgenic Human Germline Alternatives Project, presents an evening of input lectures, discussions and a performative workshop on artistic interventions for future scenarios of human genetic and inheritable modifications.
To begin our lecturers, Marc Dusseiller aka "dusjagr" and Rodrigo Martin Iglesias, will give an overview of their transdisciplinary practices, including the history of hackteria, a global network for sharing knowledge to involve artists in hands-on and Do-It-With-Others (DIWO) working with the lifesciences, and reflections on future scenarios from the 8-bit computer games of the 80ies to current real-world endeavous of genetically modifiying the human species.
We will then follow up with discussions and hands-on experiments on working with embryos, ovums, gametes, genetic materials from code to slime, in a creative and playful workshop setup, where all paticipant can collaborate on artistic interventions into the germline of a post-human future.
The Legacy of Breton In A New Age by Master Terrance LindallBBaez1
Brave Destiny 2003 for the Future for Technocratic Surrealmageddon Destiny for Andre Breton Legacy in Agenda 21 Technocratic Great Reset for Prison Planet Earth Galactica! The Prophecy of the Surreal Blasphemous Desires from the Paradise Lost Governments!
2. A Conceptual Approach to Enhance Prediction of Diabetes using Alternate Feature Selection and
Big Data Based Architecture
https://iaeme.com/Home/journal/IJARET 777 editor@iaeme.com
can help in the early detection and prediction of the disease. The main cause of diabetes remains
unknown, yet scientists believe that both genetic factors and environmental lifestyle play a
major role in diabetes. Even though it’s incurable, it can be managed by treatment and
medication. Individuals with diabetes face a risk of developing many secondary health issues.
So, early detection and treatment of diabetes becomes more important.
From the investigation of related literature, the authors found that, though there are many
research works have worked on the theme of prediction of diabetes, the accuracy of the methods
is varying one. In addition, achieving the desirable accuracy of diabetic prediction is still an
open research issue. For any prediction problem, identifying relevant features becomes an
important prerequisite as the selection of features influences the prediction accuracy. Secondly,
preprocessing of data also plays a major role in influencing the accuracy of prediction. Thirdly,
the architecture in which the prediction is performed also influences the prediction. Keeping
the above points in mind, a conceptual architecture has been proposed to improve the prediction
accuracy of diabetes. More specifically, the objective of the proposed research is to provide a
conceptual model for analyzing how the accuracy and the performance of the predication can
be improved with respect to alternate, features extraction techniques, classification techniques
in a scalable environment established using big data tools and techniques.
2. RELATED WORK
Significant amount of research works is being carried out in the area of prediction of diabetes
using machine learning techniques [1-8]. There are some research works that have employed
deep learning techniques for the prediction of diabetes[9-12]. Despite the existence of different
machine learning and deep learning models, there are other aspects namely preprocessing of
data, feature selection/extraction, technologies used for data analysis also influence the
prediction process.
One of the important investigations done is the identification of a formula for computing
Diabetes Pedigree Function[13]. This research work provides a way to compute diabetes
pedigree function which is basically dependent on the details of whether the ancestors of a
particular person is affected with diabetes or not. In this aspect, it tries to include the genetic
information in regard to diabetic disease. This work becomes important when one really
validates the machine learning model which at first constructed using benchmark dataset.
3. PROPOSED APPROACH
With a preliminary study with existing literature, it is found that accuracy and performance of
prediction of diabetes are still needed to be improved. In this aspect, the present work aims to
enhance the prediction of diabetes with efficient preprocessing techniques, feature
selection/extraction techniques, classification of disease using machine/deep learning technique
in a scalable environment, established using bigdata technologies such as Apache Spark. The
higher-level conceptual schematic of the proposed work is given in Fig. 1.
4. DATA COLLECTION
It is proposed to use dataset taken from the National Institute of Diabetes and Digestive and
Kidney Diseases (publicly available at: UCI ML Repository [14]). The Pima Indian Diabetes
(PID) dataset(which is available in the reference 34,) having: 9 = 8 + 1 (Class Attribute)
attributes, 768 records describing female patients (of which there were 500 negative instances
(65.1%) and 268 positive instances (34.9%)). The detailed description of all attributes is given
in Table 1.
3. Chellammal Surianarayanan, Sharmila Rengasamy
https://iaeme.com/Home/journal/IJARET 778 editor@iaeme.com
4.1. Data Preprocessing
In this step, the data collected is preprocessing for incomplete or missing values. In addition,
other transformation such as conversion of continuous data into categorical as required by
classification algorithms are also performed.
4.2. Feature Selection/Extraction
Fundamentally any classification algorithm classifies an incoming raw data which consists of
a set of attributes into one of the predefined class labels. For example, for a given set of
attributes, we need to classify whether the person is having diabetes or not. So, a data needs to
be classified into predefined class based on attributes. The attributes are the essential aspects
Figure 1 Higher level schematic of the proposed approach
or factors used to classify an instance into two classes, say for example, whether a person
is having diabetes or not.
There are two major feature selection techniques, namely, filter methods and wrapper
methods. It is proposed to employ different feature selection techniques such as correlation
4. A Conceptual Approach to Enhance Prediction of Diabetes using Alternate Feature Selection and
Big Data Based Architecture
https://iaeme.com/Home/journal/IJARET 779 editor@iaeme.com
based feature selection, gain ratio method, info gain, PCA, Relief method etc. Correlation is a
measure of the linear relationship each attribute and class label. From the value of correlation
measure, one can easily find out relevant attributes or highly correlated attributes. Gain ratio
method helps in reducing bias when multivalued attributes are involved in a prediction problem.
Info gain computes the reduction in entropy and determines the information gain of each
attribute in the context of the target label. Relief algorithm basically uses instance based
learning and it evaluates the quality of an attribute by analysing how good an attribute in
separating instances having instance values near to each other.
Table 1 Dataset description
S.No Attribute Name Attribute Description Mean ±S.D
1 Pregnancies Number of times a woman got pregnant 3.8± 3.3
2 Glucose(mg/dl) Glucose concentration in oral glucose
tolerance test for 120 min
120.8±31.9
3 Blood Pressure (mmHg) Diastolic Blood Pressure 69.1±19.3
4 Skin Thickness(mm) Fold Thickness of Skin 20.5±15.9
5 Insulin (mu U/mL) Serum Insulin for 2h 79.7±115.2
6 BMI(kg/m2) Body Mass Index(weight/(height)^2) 31.9±7.8
7 Diabetes Pedigree
Function
Diabetes Pedigree Function 0.4±0.3
8 Age Age (Years) 33.2±11.7
9 Outcome Class Variable (class value 1 for positive
0 for Negative diabetes
4.3. Classification Techniques
After feature selection, it is proposed to use different machine learning algorithms such as Naïve
Bayes, Random Forest, Support Vector Machine, etc. Naïve Bayes algorithms is simple multi
class probability based classifier. Random Forest classifier is basically a combination of many
decision tree classifiers and it employs a voting mechanism to finalize the class label of a given
data. It is a fast algorithm and it is capable of handling huge data sizes. Support Vector Machine
is unique in handling non-linearity in data by transforming it into a high dimension and by using
kernel functions.
4.4. Big Data based Architecture
It is proposed to carry out the analysis in an architecture which is capable of providing
scalability. So, it is proposed to perform the prediction in big data architecture which is
established used Apache Spark. Apache Spark [15-16] is a distributed platform used to process
big data in a distributed fashion. One of the important features of Spark platform is it has the
capability of handling data by keeping it in an in-memory space rather bringing the data from
hard drives as in Hadoop. This feature provides fast performance. In addition, the spark
architecture is fundamentally a distributed architecture and provides the required scalability.
When we employ deep learning algorithms, obviously the volume of data involved for analysis
would be more and the big data based architecture facilitates an efficient prediction.
Spark platform consists of various components such as Spark core (or) Resilient Distributed
Dataset (RDD), Spark SQL like functions for structured data, Machine Learning library,
Streaming Data API and Graph processing API (GraphX) as shown in Fig. 2.
5. Chellammal Surianarayanan, Sharmila Rengasamy
https://iaeme.com/Home/journal/IJARET 780 editor@iaeme.com
Figure 2 Components of Spark
5. EVALUATION
The performance of the algorithms will be evaluated using different measure such as accuracy,
computation time and scalability. Further, The performance of the algorithms will be calculated
using confusion matrix. Confusion matrix is represented using four measures, namely, True
Positive (TP), False Positive (FP), False Negative (FN) and True Negative (TN) which are
defined as follows.
• True Positive (TP) represents that the actual value as true and predicted value as true.
• False Positive (FP) denotes that the actual value as false and predicted value as true.
• False Negative (FN) represents that the actual value as true and predicted value as false.
• True Negative (TN) represents that the actual value as false and predicted value as false.
A sample confusion matrix looks like as given in Fig. 3.
Figure 3 Confusion matrix
Further, the mathematical formulae for different evaluation measures, namely, accuracy,
precision, recall and F-score are given through the following equations
(i) Accuracy
Accuracy is calculated as the number of all correct predictions divided by the total number of
the sample. Accuracy can be calculated with the following formula,
6. A Conceptual Approach to Enhance Prediction of Diabetes using Alternate Feature Selection and
Big Data Based Architecture
https://iaeme.com/Home/journal/IJARET 781 editor@iaeme.com
TP +TN
Accuracy =
TP +TN +FN +FP
(ii) Precision
Precision is calculated as the number of correct positive predictions divided by the total number
of positive predictions. The formula for calculating prediction is given below:
TP
Precision =
TP + FP
(iii) Recall
Recall is calculated as the number of correct positive predictions divided by the total number
of positives. Recall can be calculated as,
TP
Recall =
TP +FN
(iv) F – score
F – score is a harmonic mean which helps to measure recall and precision at the same time. It
can be calculated as,
Recall*Precision
F-score = 2*
Recall+Precision
6. CONCLUSION
A conceptual model has been proposed for the prediction of diabetes using machine learning
algorithms, with an intention to improve the prediction accuracy in a big data based architecture.
In addition, the scalable architecture, it is proposed to use different feature selection techniques
for finding relevant attributes. The proposed architecture will be implemented and will be
analysed with mentioned benchmark dataset. After that, the architecture will be validated with
real data collected from individuals. Elaborate experimentation will be performed with many
datasets and a framework will be developed.
REFERENCES
[1] Sisodia, D.; Sisodia, D.S. Prediction of Diabetes using Classification Algorithms. Procedia
Comput. Sci. 2018, 132, 1578–1585.
[2] Mercaldo, F.; Nardone, V.; Santone, A. Diabetes Mellitus Affected Patients Classification and
Diagnosis through Machine Learning Techniques. Procedia Comput. Sci. 2017, 112, 2519–
2528.
[3] Negi, A.; Jaiswal, V. A first attempt to develop a diabetes prediction method based on different
global datasets. In Proceedings of the 2016 Fourth International Conference on Parallel,
Distributed and Grid Computing (PDGC), Waknaghat, India, 22–24 December 2016; pp. 237–
241.
[4] Soltani, Z.; Jafarian, A. A New Artificial Neural Networks Approach for Diagnosing Diabetes
Disease Type II. Int. J. Adv. Comput. Sci. Appl. 2016, 7, 89–94.
7. Chellammal Surianarayanan, Sharmila Rengasamy
https://iaeme.com/Home/journal/IJARET 782 editor@iaeme.com
[5] Somnath, R.; Suvojit, M.; Sanket, B.; Riyanka, K.; Priti, G.; Sayantan, M.; Subhas, B. Prediction
of Diabetes Type-II Using a Two-Class Neural Network. In Proceedings of the 2017
International Conference on Computational Intelligence, Communications, and Business
Analytics, Kolkata, India, 24–25 March 2017; pp. 65–71.
[6] Mamuda, M.; Sathasivam, S. Predicting the survival of diabetes using neural network. In
Proceedings of the AIP Conference Proceedings, Bydgoszcz, Poland, 9–11 May 2017; Volume
1870, pp. 40–46.
[7] Malik, S.; Khadgawat, R.; Anand, S.; Gupta, S. Non-invasive detection of fasting blood glucose
level via electrochemical measurement of saliva. SpringerPlus 2016, 5, 701.
[8] Perveen, S.; Shahbaz, M.; Guergachi, A.; Keshavjee, K. Performance Analysis of Data Mining
Classification Techniques to Predict Diabetes. Procedia Comput. Sci. 2016, 82, 115–121.
[9] Mohebbi, A.; Aradóttir, T.B.; Johansen, A.R.; Bengtsson, H.; Fraccaro, M.; Mørup, M. A deep
learning approach to adherence detection for type 2 diabetics. In Proceedings of the 2017 39th
Annual International Conference of the IEEE Engineering in Medicine and Biology Society
(EMBC), Jeju, Korea, 11–15 July 2017; pp. 2896–2899.
[10] Miotto, R.; Li, L.; Kidd, B.A.; Dudley, J.T. Deep Patient: An Unsupervised Representation to
Predict the Future of Patients from the Electronic Health Records. Sci. Rep. 2016, 6, 26094.
[11] Pham, T.; Tran, T.; Phung, D.; Venkatesh, S. Predicting healthcare trajectories from medical
records: A deep learning approach. J. Biomed. Inform. 2017, 69, 218–229.
[12] Lekha, S.; Suchetha, M. Real-Time Non-Invasive Detection and Classification of Diabetes
Using Modified Convolution Neural Network. IEEE J. Biomed. Health Inform. 2018, 22, 1630–
1636.
[13] Smith, Jack, Everhard, J, Dickson, W, Johannes, Richard, “Using the ADAP Learning
Algorithm to Forcast the Onset of Diabetes Mellitus”, Proceedings of Annual Symposium on
Computer Applications in Medical Care, November 1988, from PubMed Central
[14] M. Lichman, "Pima Indians diabetes database," ed. Center for machine learning and intelligent
systems.: UCI Machine Learning repository.
[15] Abdul Ghaffar Shoro and Tariq Rahim Soomro, “Big Data Analysis: Apache Spark
Perspective”, Global Journal of Computer Science and Technology: Computer Software & Data
Engineering, Publisher: Global Journal Inc., USA, e-ISSN: 0975-4172, p-ISSN: 0975-4350,
Vol. 15, Issue 1, pp. 1-10, January 2015.
[16] Soumya Ounacer, Mohamed Amine Talhaoui, Soufiane Ardchir, Abderrahmane Daif and
Mohamed Azouazi, “A New Architecture for Real Time Data Stream Processing”, International
Journal of Advanced Computer Science and Applications (IJACSA), Vol. 8, Issue 11, pp. 44-
51, 2017.