Machine learning algorithms play a vital role in prediction of many diseases such as heart disease, diabetes, cancer, lung disease etc. The applicability of machine learning algorithms to healthcare domain relieves the burden of physicians as it is impractical to scan manually all the data collected over a period of time in order to arrive at some valuable information. Machine learning algorithms learn from the training dataset and they become capable of thinking like a human. Once the algorithm completes it learning with training dataset, it can automatically predict the target output label of any unseen data. In this work, predicting diabetes using machine learning algorithms has been taken up. A conceptual architecture has been proposed based on big data architecture.
2. A Conceptual Approach to Enhance Prediction of Diabetes using Alternate Feature Selection and
Big Data Based Architecture
https://iaeme.com/Home/journal/IJARET 777 editor@iaeme.com
can help in the early detection and prediction of the disease. The main cause of diabetes remains
unknown, yet scientists believe that both genetic factors and environmental lifestyle play a
major role in diabetes. Even though it’s incurable, it can be managed by treatment and
medication. Individuals with diabetes face a risk of developing many secondary health issues.
So, early detection and treatment of diabetes becomes more important.
From the investigation of related literature, the authors found that, though there are many
research works have worked on the theme of prediction of diabetes, the accuracy of the methods
is varying one. In addition, achieving the desirable accuracy of diabetic prediction is still an
open research issue. For any prediction problem, identifying relevant features becomes an
important prerequisite as the selection of features influences the prediction accuracy. Secondly,
preprocessing of data also plays a major role in influencing the accuracy of prediction. Thirdly,
the architecture in which the prediction is performed also influences the prediction. Keeping
the above points in mind, a conceptual architecture has been proposed to improve the prediction
accuracy of diabetes. More specifically, the objective of the proposed research is to provide a
conceptual model for analyzing how the accuracy and the performance of the predication can
be improved with respect to alternate, features extraction techniques, classification techniques
in a scalable environment established using big data tools and techniques.
2. RELATED WORK
Significant amount of research works is being carried out in the area of prediction of diabetes
using machine learning techniques [1-8]. There are some research works that have employed
deep learning techniques for the prediction of diabetes[9-12]. Despite the existence of different
machine learning and deep learning models, there are other aspects namely preprocessing of
data, feature selection/extraction, technologies used for data analysis also influence the
prediction process.
One of the important investigations done is the identification of a formula for computing
Diabetes Pedigree Function[13]. This research work provides a way to compute diabetes
pedigree function which is basically dependent on the details of whether the ancestors of a
particular person is affected with diabetes or not. In this aspect, it tries to include the genetic
information in regard to diabetic disease. This work becomes important when one really
validates the machine learning model which at first constructed using benchmark dataset.
3. PROPOSED APPROACH
With a preliminary study with existing literature, it is found that accuracy and performance of
prediction of diabetes are still needed to be improved. In this aspect, the present work aims to
enhance the prediction of diabetes with efficient preprocessing techniques, feature
selection/extraction techniques, classification of disease using machine/deep learning technique
in a scalable environment, established using bigdata technologies such as Apache Spark. The
higher-level conceptual schematic of the proposed work is given in Fig. 1.
4. DATA COLLECTION
It is proposed to use dataset taken from the National Institute of Diabetes and Digestive and
Kidney Diseases (publicly available at: UCI ML Repository [14]). The Pima Indian Diabetes
(PID) dataset(which is available in the reference 34,) having: 9 = 8 + 1 (Class Attribute)
attributes, 768 records describing female patients (of which there were 500 negative instances
(65.1%) and 268 positive instances (34.9%)). The detailed description of all attributes is given
in Table 1.
3. Chellammal Surianarayanan, Sharmila Rengasamy
https://iaeme.com/Home/journal/IJARET 778 editor@iaeme.com
4.1. Data Preprocessing
In this step, the data collected is preprocessing for incomplete or missing values. In addition,
other transformation such as conversion of continuous data into categorical as required by
classification algorithms are also performed.
4.2. Feature Selection/Extraction
Fundamentally any classification algorithm classifies an incoming raw data which consists of
a set of attributes into one of the predefined class labels. For example, for a given set of
attributes, we need to classify whether the person is having diabetes or not. So, a data needs to
be classified into predefined class based on attributes. The attributes are the essential aspects
Figure 1 Higher level schematic of the proposed approach
or factors used to classify an instance into two classes, say for example, whether a person
is having diabetes or not.
There are two major feature selection techniques, namely, filter methods and wrapper
methods. It is proposed to employ different feature selection techniques such as correlation
4. A Conceptual Approach to Enhance Prediction of Diabetes using Alternate Feature Selection and
Big Data Based Architecture
https://iaeme.com/Home/journal/IJARET 779 editor@iaeme.com
based feature selection, gain ratio method, info gain, PCA, Relief method etc. Correlation is a
measure of the linear relationship each attribute and class label. From the value of correlation
measure, one can easily find out relevant attributes or highly correlated attributes. Gain ratio
method helps in reducing bias when multivalued attributes are involved in a prediction problem.
Info gain computes the reduction in entropy and determines the information gain of each
attribute in the context of the target label. Relief algorithm basically uses instance based
learning and it evaluates the quality of an attribute by analysing how good an attribute in
separating instances having instance values near to each other.
Table 1 Dataset description
S.No Attribute Name Attribute Description Mean ±S.D
1 Pregnancies Number of times a woman got pregnant 3.8± 3.3
2 Glucose(mg/dl) Glucose concentration in oral glucose
tolerance test for 120 min
120.8±31.9
3 Blood Pressure (mmHg) Diastolic Blood Pressure 69.1±19.3
4 Skin Thickness(mm) Fold Thickness of Skin 20.5±15.9
5 Insulin (mu U/mL) Serum Insulin for 2h 79.7±115.2
6 BMI(kg/m2) Body Mass Index(weight/(height)^2) 31.9±7.8
7 Diabetes Pedigree
Function
Diabetes Pedigree Function 0.4±0.3
8 Age Age (Years) 33.2±11.7
9 Outcome Class Variable (class value 1 for positive
0 for Negative diabetes
4.3. Classification Techniques
After feature selection, it is proposed to use different machine learning algorithms such as Naïve
Bayes, Random Forest, Support Vector Machine, etc. Naïve Bayes algorithms is simple multi
class probability based classifier. Random Forest classifier is basically a combination of many
decision tree classifiers and it employs a voting mechanism to finalize the class label of a given
data. It is a fast algorithm and it is capable of handling huge data sizes. Support Vector Machine
is unique in handling non-linearity in data by transforming it into a high dimension and by using
kernel functions.
4.4. Big Data based Architecture
It is proposed to carry out the analysis in an architecture which is capable of providing
scalability. So, it is proposed to perform the prediction in big data architecture which is
established used Apache Spark. Apache Spark [15-16] is a distributed platform used to process
big data in a distributed fashion. One of the important features of Spark platform is it has the
capability of handling data by keeping it in an in-memory space rather bringing the data from
hard drives as in Hadoop. This feature provides fast performance. In addition, the spark
architecture is fundamentally a distributed architecture and provides the required scalability.
When we employ deep learning algorithms, obviously the volume of data involved for analysis
would be more and the big data based architecture facilitates an efficient prediction.
Spark platform consists of various components such as Spark core (or) Resilient Distributed
Dataset (RDD), Spark SQL like functions for structured data, Machine Learning library,
Streaming Data API and Graph processing API (GraphX) as shown in Fig. 2.
5. Chellammal Surianarayanan, Sharmila Rengasamy
https://iaeme.com/Home/journal/IJARET 780 editor@iaeme.com
Figure 2 Components of Spark
5. EVALUATION
The performance of the algorithms will be evaluated using different measure such as accuracy,
computation time and scalability. Further, The performance of the algorithms will be calculated
using confusion matrix. Confusion matrix is represented using four measures, namely, True
Positive (TP), False Positive (FP), False Negative (FN) and True Negative (TN) which are
defined as follows.
• True Positive (TP) represents that the actual value as true and predicted value as true.
• False Positive (FP) denotes that the actual value as false and predicted value as true.
• False Negative (FN) represents that the actual value as true and predicted value as false.
• True Negative (TN) represents that the actual value as false and predicted value as false.
A sample confusion matrix looks like as given in Fig. 3.
Figure 3 Confusion matrix
Further, the mathematical formulae for different evaluation measures, namely, accuracy,
precision, recall and F-score are given through the following equations
(i) Accuracy
Accuracy is calculated as the number of all correct predictions divided by the total number of
the sample. Accuracy can be calculated with the following formula,
6. A Conceptual Approach to Enhance Prediction of Diabetes using Alternate Feature Selection and
Big Data Based Architecture
https://iaeme.com/Home/journal/IJARET 781 editor@iaeme.com
TP +TN
Accuracy =
TP +TN +FN +FP
(ii) Precision
Precision is calculated as the number of correct positive predictions divided by the total number
of positive predictions. The formula for calculating prediction is given below:
TP
Precision =
TP + FP
(iii) Recall
Recall is calculated as the number of correct positive predictions divided by the total number
of positives. Recall can be calculated as,
TP
Recall =
TP +FN
(iv) F – score
F – score is a harmonic mean which helps to measure recall and precision at the same time. It
can be calculated as,
Recall*Precision
F-score = 2*
Recall+Precision
6. CONCLUSION
A conceptual model has been proposed for the prediction of diabetes using machine learning
algorithms, with an intention to improve the prediction accuracy in a big data based architecture.
In addition, the scalable architecture, it is proposed to use different feature selection techniques
for finding relevant attributes. The proposed architecture will be implemented and will be
analysed with mentioned benchmark dataset. After that, the architecture will be validated with
real data collected from individuals. Elaborate experimentation will be performed with many
datasets and a framework will be developed.
REFERENCES
[1] Sisodia, D.; Sisodia, D.S. Prediction of Diabetes using Classification Algorithms. Procedia
Comput. Sci. 2018, 132, 1578–1585.
[2] Mercaldo, F.; Nardone, V.; Santone, A. Diabetes Mellitus Affected Patients Classification and
Diagnosis through Machine Learning Techniques. Procedia Comput. Sci. 2017, 112, 2519–
2528.
[3] Negi, A.; Jaiswal, V. A first attempt to develop a diabetes prediction method based on different
global datasets. In Proceedings of the 2016 Fourth International Conference on Parallel,
Distributed and Grid Computing (PDGC), Waknaghat, India, 22–24 December 2016; pp. 237–
241.
[4] Soltani, Z.; Jafarian, A. A New Artificial Neural Networks Approach for Diagnosing Diabetes
Disease Type II. Int. J. Adv. Comput. Sci. Appl. 2016, 7, 89–94.
7. Chellammal Surianarayanan, Sharmila Rengasamy
https://iaeme.com/Home/journal/IJARET 782 editor@iaeme.com
[5] Somnath, R.; Suvojit, M.; Sanket, B.; Riyanka, K.; Priti, G.; Sayantan, M.; Subhas, B. Prediction
of Diabetes Type-II Using a Two-Class Neural Network. In Proceedings of the 2017
International Conference on Computational Intelligence, Communications, and Business
Analytics, Kolkata, India, 24–25 March 2017; pp. 65–71.
[6] Mamuda, M.; Sathasivam, S. Predicting the survival of diabetes using neural network. In
Proceedings of the AIP Conference Proceedings, Bydgoszcz, Poland, 9–11 May 2017; Volume
1870, pp. 40–46.
[7] Malik, S.; Khadgawat, R.; Anand, S.; Gupta, S. Non-invasive detection of fasting blood glucose
level via electrochemical measurement of saliva. SpringerPlus 2016, 5, 701.
[8] Perveen, S.; Shahbaz, M.; Guergachi, A.; Keshavjee, K. Performance Analysis of Data Mining
Classification Techniques to Predict Diabetes. Procedia Comput. Sci. 2016, 82, 115–121.
[9] Mohebbi, A.; Aradóttir, T.B.; Johansen, A.R.; Bengtsson, H.; Fraccaro, M.; Mørup, M. A deep
learning approach to adherence detection for type 2 diabetics. In Proceedings of the 2017 39th
Annual International Conference of the IEEE Engineering in Medicine and Biology Society
(EMBC), Jeju, Korea, 11–15 July 2017; pp. 2896–2899.
[10] Miotto, R.; Li, L.; Kidd, B.A.; Dudley, J.T. Deep Patient: An Unsupervised Representation to
Predict the Future of Patients from the Electronic Health Records. Sci. Rep. 2016, 6, 26094.
[11] Pham, T.; Tran, T.; Phung, D.; Venkatesh, S. Predicting healthcare trajectories from medical
records: A deep learning approach. J. Biomed. Inform. 2017, 69, 218–229.
[12] Lekha, S.; Suchetha, M. Real-Time Non-Invasive Detection and Classification of Diabetes
Using Modified Convolution Neural Network. IEEE J. Biomed. Health Inform. 2018, 22, 1630–
1636.
[13] Smith, Jack, Everhard, J, Dickson, W, Johannes, Richard, “Using the ADAP Learning
Algorithm to Forcast the Onset of Diabetes Mellitus”, Proceedings of Annual Symposium on
Computer Applications in Medical Care, November 1988, from PubMed Central
[14] M. Lichman, "Pima Indians diabetes database," ed. Center for machine learning and intelligent
systems.: UCI Machine Learning repository.
[15] Abdul Ghaffar Shoro and Tariq Rahim Soomro, “Big Data Analysis: Apache Spark
Perspective”, Global Journal of Computer Science and Technology: Computer Software & Data
Engineering, Publisher: Global Journal Inc., USA, e-ISSN: 0975-4172, p-ISSN: 0975-4350,
Vol. 15, Issue 1, pp. 1-10, January 2015.
[16] Soumya Ounacer, Mohamed Amine Talhaoui, Soufiane Ardchir, Abderrahmane Daif and
Mohamed Azouazi, “A New Architecture for Real Time Data Stream Processing”, International
Journal of Advanced Computer Science and Applications (IJACSA), Vol. 8, Issue 11, pp. 44-
51, 2017.